as you may know I'm working on different new IOs for Beam.
I have some suggestions that I would like to discuss with you all.
The SDK provides a Sink abstract class. It represents a resource that
can be written using a Write transform:
The Sink creates a Writer. A Writer is an abstract class where we
override the open(), write(), close() methods.
Today, no IOs use Sink: they directly use a DoFn.
I fully agree that it's very convenient to implement a Sink but it may
appear like non consistent and can "perturb" the users/developers.
It comes me to the second point.
Today, a IO Read apply() method use a source via Read.from(source).
However, if the source API is not required (for instance in the case of
JDBC where we can't really implement getEstimatedSizeBytes() and
splitIntoBundles()), it's possible to directly use a DoFn instead of a
So again, it could appear like non consistent.
Maybe it would make more sense to "force" the usage of Source even if we
don't leverage all Source features (for instance, in the case of JDBC
IO, getEstimatedSizeBytes() will return 0L and splitIntoBundles() will
return a list with a single source).
The same for Sink: even if a Sink can be implemented with DoFn, it would
be more consistent to implement it with Sink (or remove Sink ;)).
3/ Type Converter
Today each IO represent an element in the PCollection as he wants.
For instance, the following pipeline won't compile straight forward:
p.apply(JmsIO.Read()...) // returns a PCollection<JmsRecord>
.apply(JdbcIO.Write()...) // expects PCollection<JdbcDataRecord>
The user will have to "convert" PCollection<JmsRecord> as
Maybe it makes sense to provide a Converter in the IOs and use kind of
schema and canonical format (optionally), for instance based on Avro.
I added this point in the "Technical Vision" while ago, but I think it
would simplify the way of writing pipelines.
Talend - http://www.talend.com