This is a really good question, Sergio. You got right away to the crux of the problem -- how to express such pattern in the Beam model.
The answer depends whether the data is static, e.g., whether it is known at pipeline construction time / computed in the earlier stages of the pipeline, or perhaps evolving during pipeline execution. I'll give a high-level answer -- feel free to share more information about your use case and we can drill into specific details. In the simplest case, Beam supports "files to stage" concept if the data is known apriori. In this case, runners will distribute the data to all workers before computation starts, and your logic can depend on the data being available locally on each worker. If this is not sufficient, Beam's side inputs are the right primitive. We support several access patterns for side inputs, including distributed lookup and various types of caching. This can work really well, particularly with a well-optimized runner. Other alternatives typically include access to a shared storage, which is a lower-level approach and often requires more work. -- Back to Ismael's question -- Beam is great at orchestrating such pipelines. You can build the pipeline that prepares data for a custom system, manages its invocation, and processes its output. PTransforms can encapsulate arbitrary computation, including invocation of an outside logic / system. It would be great to have a set of PTransform libraries that wrap such computations. On Wed, Jun 15, 2016 at 2:45 AM, Jean-Baptiste Onofré <[email protected]> wrote: > I would say DSL + PTransform should work. > > But certainly some PoC to do ;) > > Regards > JB > > > On 06/15/2016 11:39 AM, Ismaël Mejía wrote: > >> One interesting point that Sergio mentions and that it is getting lost in >> the discussion is how to integrate other dataflow style frameworks into >> Beam, e.g. Tensorflow. I am really curious about what the others have to >> say about this since this is probably one question that will come once >> more >> users write Pipelines on Beam. Any ideas on this ? or the solution is just >> to write some 'integration PTransforms' and that's it ? >> >> Regards, >> Ismaël >> >> ps. I forgot to say Hi and welcome Sergio :). >> >> >> On Wed, Jun 15, 2016 at 11:18 AM, Jean-Baptiste Onofré <[email protected]> >> wrote: >> >> Not the Beam Model for sure (the Beam Model is about the pipeline design). >>> >>> The Beam Runner API can help there, but the final implement is on the >>> runner itself. >>> >>> Regards >>> JB >>> >>> >>> On 06/15/2016 10:18 AM, Sergio Fernández wrote: >>> >>> Hi Jean-Baptiste, >>>> >>>> On Tue, Jun 14, 2016 at 12:45 PM, Jean-Baptiste Onofré <[email protected] >>>> > >>>> wrote: >>>> >>>> >>>>> Welcome aboard, and good to discuss with you during ApacheCon. >>>>> >>>>> >>>>> Was nice to put you all faces ;-) >>>> >>>> >>>> Distribution of the resources is a point related to runner, and more >>>> >>>>> specifically to the execution environment of the runner. Each >>>>> runner/backend will implement their own logic. >>>>> >>>>> >>>>> Yes, I can understand. But I wonder if the Beam Model provides any >>>> primitive to deal with such aspects in an abstract way. I guess I'd need >>>> to >>>> go deeper into Beam to approach you with more concrete questions; so for >>>> now it's fine. >>>> >>>> Regarding the Python SDK, we discussed about that last week: it's on the >>>> >>>> way. We should have the Python SDK very soon (we were busy with the >>>>> first >>>>> release). >>>>> >>>>> >>>> >>>> Yep, I knew that was the plan. It's really cool to have it already is >>>> master to the next release :-) >>>> >>>> Thanks. >>>> >>>> >>>> >>>> >>>> >>>> On 06/14/2016 12:38 PM, Sergio Fernández wrote: >>>>> >>>>> Hi guys, >>>>> >>>>>> >>>>>> I'm newbie in the Beam community, but as someone who has used DataFlow >>>>>> in >>>>>> the past I've been following the podling since you came to ASK. I'm >>>>>> very >>>>>> happy to see that 0.1.0-incubating is finally going out, >>>>>> congratulations >>>>>> for such great milestone. >>>>>> >>>>>> I discussed with some of you guys in the last ApacheCon, and for me >>>>>> was >>>>>> good to know the Python SDK was just a matter of time and should come >>>>>> to >>>>>> Beam at some point. So coming back to the original plans < >>>>>> >>>>>> >>>>>> >>>>>> http://beam.incubator.apache.org/beam/python/sdk/2016/02/25/python-sdk-now-public.html >>>>>> >>>>>> , >>>>>>> >>>>>>> do you manage any timeline to bring the Python SDK to Beam? >>>>>> >>>>>> So I'd like to bring a question how Beam plans to deal with the >>>>>> distribution of resources across all nodes, something I know it not >>>>>> really >>>>>> clean with some runners (e.g., Spark). More concretely, we're using >>>>>> Keras >>>>>> < >>>>>> http://keras.io/>, a deep learning Python library that is capable of >>>>>> running on top of either TensorFlow or Theano. Historically I know >>>>>> DataFlow >>>>>> and TensorFlow are not very compatible. But I wonder if the project >>>>>> has >>>>>> already discussed how to support running Keras (TensorFlow) tasks on >>>>>> Beam. >>>>>> For us is more for querying than for training, so I'd like to know if >>>>>> the >>>>>> Beam Model could natively support the distribution of the models >>>>>> (sometimes >>>>>> several GB). >>>>>> >>>>>> Thanks in advance. >>>>>> >>>>>> Cheers, >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>> Jean-Baptiste Onofré >>>>> [email protected] >>>>> http://blog.nanthrax.net >>>>> Talend - http://www.talend.com >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>> Jean-Baptiste Onofré >>> [email protected] >>> http://blog.nanthrax.net >>> Talend - http://www.talend.com >>> >>> >> > -- > Jean-Baptiste Onofré > [email protected] > http://blog.nanthrax.net > Talend - http://www.talend.com >
