[ 
https://issues.apache.org/jira/browse/BEAM-92?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16105942#comment-16105942
 ] 

Eugene Kirpichov commented on BEAM-92:
--------------------------------------

I'm not sure whether having the current JIRA still makes sense. The PRs that 
landed on this JIRA are not about a generic data-dependent sink API, they are 
just examples of this idea applied to different IOs. I doubt that a complete 
generalization is possible, so I think we should close this JIRA and file more 
concrete ones for the generalizations that are possible (e.g. unify the 
DynamicDestinations APIs between files and bigquery, come up with best 
practices for data-dependent IOs, etc.).

> Data-dependent sinks
> --------------------
>
>                 Key: BEAM-92
>                 URL: https://issues.apache.org/jira/browse/BEAM-92
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-core
>            Reporter: Eugene Kirpichov
>            Assignee: Reuven Lax
>
> Current sink API writes all data to a single destination, but there are many 
> use cases where different pieces of data need to be routed to different 
> destinations where the set of destinations is data-dependent (so can't be 
> implemented with a Partition transform).
> One internally discussed proposal was an API of the form:
> {code}
> PCollection<Void> PCollection<T>.apply(
>     Write.using(DoFn<T, SinkT> where,
>                 MapFn<SinkT, WriteOperation<WriteResultT, T>> how)
> {code}
> so an item T gets written to a destination (or multiple destinations) 
> determined by "where"; and the writing strategy is determined by "how" that 
> produces a WriteOperation (current API - global init/write/global finalize 
> hooks) for any given destination.
> This API also has other benefits:
> * allows the SinkT to be computed dynamically (in "where"), rather than 
> specified at pipeline construction time
> * removes the necessity for a Sink class entirely
> * is sequenceable w.r.t. downstream transforms (you can stick transforms onto 
> the returned PCollection<Void>, while the current Write.to() returns a PDone)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to