Beam IO: suggestions and new features

Jean-Baptiste Onofré Fri, 16 Sep 2016 07:19:46 -0700

Hi all,

as you may know I'm working on different new IOs for Beam.


I have some suggestions that I would like to discuss with you all.

1/ Sink

The SDK provides a Sink abstract class. It represents a resource thatcan be written using a Write transform:


  p.apply(Write.to(new MySink()));

The Sink creates a Writer. A Writer is an abstract class where weoverride the open(), write(), close() methods.

Today, no IOs use Sink: they directly use a DoFn.

I fully agree that it's very convenient to implement a Sink but it mayappear like non consistent and can "perturb" the users/developers.


It comes me to the second point.

2/ Source
Today, a IO Read apply() method use a source via Read.from(source).

However, if the source API is not required (for instance in the case ofJDBC where we can't really implement getEstimatedSizeBytes() andsplitIntoBundles()), it's possible to directly use a DoFn instead of aSource.


So again, it could appear like non consistent.

Maybe it would make more sense to "force" the usage of Source even if wedon't leverage all Source features (for instance, in the case of JDBCIO, getEstimatedSizeBytes() will return 0L and splitIntoBundles() willreturn a list with a single source).The same for Sink: even if a Sink can be implemented with DoFn, it wouldbe more consistent to implement it with Sink (or remove Sink ;)).


3/ Type Converter
Today each IO represent an element in the PCollection as he wants.
For instance, the following pipeline won't compile straight forward:

p.apply(JmsIO.Read()...) // returns a PCollection<JmsRecord>
 .apply(JdbcIO.Write()...) // expects PCollection<JdbcDataRecord>

The user will have to "convert" PCollection<JmsRecord> asPCollection<JdbcDataRecord>.

Maybe it makes sense to provide a Converter in the IOs and use kind ofschema and canonical format (optionally), for instance based on Avro.I added this point in the "Technical Vision" while ago, but I think itwould simplify the way of writing pipelines.


Thoughts ?

Regards
JB
--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com

Beam IO: suggestions and new features

Reply via email to