Hi all,

as you may know I'm working on different new IOs for Beam.

I have some suggestions that I would like to discuss with you all.

1/ Sink
The SDK provides a Sink abstract class. It represents a resource that can be written using a Write transform:

  p.apply(Write.to(new MySink()));

The Sink creates a Writer. A Writer is an abstract class where we override the open(), write(), close() methods.
Today, no IOs use Sink: they directly use a DoFn.
I fully agree that it's very convenient to implement a Sink but it may appear like non consistent and can "perturb" the users/developers.

It comes me to the second point.

2/ Source
Today, a IO Read apply() method use a source via Read.from(source).

However, if the source API is not required (for instance in the case of JDBC where we can't really implement getEstimatedSizeBytes() and splitIntoBundles()), it's possible to directly use a DoFn instead of a Source.

So again, it could appear like non consistent.

Maybe it would make more sense to "force" the usage of Source even if we don't leverage all Source features (for instance, in the case of JDBC IO, getEstimatedSizeBytes() will return 0L and splitIntoBundles() will return a list with a single source). The same for Sink: even if a Sink can be implemented with DoFn, it would be more consistent to implement it with Sink (or remove Sink ;)).

3/ Type Converter
Today each IO represent an element in the PCollection as he wants.
For instance, the following pipeline won't compile straight forward:

p.apply(JmsIO.Read()...) // returns a PCollection<JmsRecord>
 .apply(JdbcIO.Write()...) // expects PCollection<JdbcDataRecord>

The user will have to "convert" PCollection<JmsRecord> as PCollection<JdbcDataRecord>.

Maybe it makes sense to provide a Converter in the IOs and use kind of schema and canonical format (optionally), for instance based on Avro. I added this point in the "Technical Vision" while ago, but I think it would simplify the way of writing pipelines.

Thoughts ?

Regards
JB
--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Reply via email to