Thanks JB for detailed notes.

On Fri, Mar 23, 2018 at 2:43 PM Eugene Kirpichov <[email protected]>
wrote:

> Hi! Thanks for the notes.
>
> On Fri, Mar 23, 2018 at 3:07 AM Jean-Baptiste Onofré <[email protected]>
> wrote:
>
>> Hi all,
>>
>> Sorry for the delay, but I got issues with my e-mail provider (I was not
>> able to
>> send e-mails :( ).
>>
>> Last week during Beam Summit, I had the change to participate to the IO
>> brainstorming session.
>>
>> Here's the minute notes:
>>
>> 1. IOs set
>> We now have a decent number of IOs in Beam, and new are coming (ParquetIO,
>> RabbitMQIO). Users mentioned a new file format you could support: HDF5.
>> It would
>> be an Python IO.
>> I will create the Jira about HDF5.
>> Other IOs will also be in preparation, coming along with SDF support.
>>
>
As Eila mentioned, we are talking to HDF5 group to determine if there's
somebody whose willing to write a HDF5 IO for Python SDK. I'll be happy to
review it. Looks like Eila created
https://issues.apache.org/jira/browse/BEAM-3850 for this.


>
>> 2. IOs and SDKs
>> This point was related to the portability layer: how can I use a Java IO
>> in
>> Python or the opposite ? Today, most of the IOs are related to Java SDK,
>> and
>> it's a bit frustrating for Python SDK users. Users are looking forward
>> portability layer, however they also expressed some questions about Docker
>> requirements. I think we should prepare a clean answer to this point.
>>
>
> I'm pretty sure this is on the radar this quarter, but I don't remember
> whose radar.
>

I hope to look into some aspects of this in next few months. Created
https://issues.apache.org/jira/browse/BEAM-3923 with more info.

>
>
>>
>> 3. PCollection Headers
>> Users want more "dynamic" IOs, maybe that a IO behavior could change
>> depending
>> of the element they are considering in the PCollection. I introduced what
>> we are
>> using in Apache Camel: Message Headers. The Camel components endpoints
>> (equivalent of Beam IOs) can use the headers: for instance the camel-http
>> component can use a Camel.HTTP_URL header. We already discussed about
>> PCollection headers/hints/annotation/metadata (whatever the name we give)
>> and I
>> still think it would be a great feature for both IOs and even the runners.
>> I'm proposing to create a Jira about that, I will be more than happy to
>> work on
>> this one.
>>
>
> Do you have a use case in mind that cannot be solved within the current
> approach to IOs? I think we have a pretty reasonable approach to "dynamic"
> IOs too, exemplified by FileIO.writeDynamic().
>
>
>>
>> 4. Schema
>> As you might know, we are working on adding schema support in
>> PCollection. This
>> feature can be leveraged by IOs. Especially, I think it would reduce the
>> "wrapping" made by IOs (like KafkaRecord, JmsRecord, ...) and easier data
>> convert.
>>
>> 5. Error Handling
>> Users would need a generic error handling in the IOs. Today the error
>> handling
>> is managed by each IOs. I introduced the error handler we are using in
>> Apache
>> Camel (sorry again ;)) and especially the default error handler features
>> like:
>> redelivery policy, recoverable/irrecoverable error handling, onWhen,
>> onException, whileTrue, ...
>> The error handler is not at component level but at routing engine level.
>> We
>> could imagine something similar at pipeline level.
>> Thoughts ?
>>
>
> Can you give some example use cases here too?
> I'm sure we can add some useful abstractions related to error handling,
> but picking the right level of abstraction for such an API will require
> very careful design. E.g. something like "a pipeline-global deadletter
> collection of records that failed processing" sounds useful in theory, but
> I think is impossible to define in a useful way compatible with the Beam
> model, and I think it has to be left to individual transforms.
>
>
>> I hope I didn't forget something ;)
>>
>> To summarize:
>> - I will create new Jiras for HDF5 and other new IOs
>> - We have to work on documentation/explanation about portability layer &
>> IOs
>> - I will start a separate thread for error handling discussion
>> - Nothing to do about schema: it has already started.
>>
>> Regards
>> JB
>> --
>> Jean-Baptiste Onofré
>> [email protected]
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>

Reply via email to