We are in process of porting Cloud Dataflow documentation to Beam, so I'll give you a mix of Dataflow and Beam links.
FilesToStage is a pipeline option [1], [2]. Super-easy to use. Side inputs are a ParDo concept [3]. If you hit any rough edges, please let us know -- I'd be glad to help! [1] https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options [2] https://beam.incubator.apache.org/javadoc/0.1.0-incubating/org/apache/beam/runners/dataflow/options/DataflowPipelineWorkerPoolOptions.html#getFilesToStage-- [3] https://cloud.google.com/dataflow/model/par-do#side-inputs On Thu, Jun 16, 2016 at 1:40 AM, Sergio Fernández <[email protected]> wrote: > Hi Davor, > > On Thu, Jun 16, 2016 at 3:04 AM, Davor Bonaci <[email protected]> > wrote: > > > This is a really good question, Sergio. You got right away to the crux of > > the problem -- how to express such pattern in the Beam model. > > > > The answer depends whether the data is static, e.g., whether it is known > at > > pipeline construction time / computed in the earlier stages of the > > pipeline, or perhaps evolving during pipeline execution. I'll give a > > high-level answer -- feel free to share more information about your use > > case and we can drill into specific details. > > > > Well, as a said, for us is more interesting to use Beam in processing time > that for training purposes. In the past we have experimented a bit with > approaches like TensorSpark <https://github.com/adatao/tensorspark>, but > the critical aspect is exploitation of the models. Therefore we could > assume the models are static data. > > > > > In the simplest case, Beam supports "files to stage" concept if the data > is > > known apriori. In this case, runners will distribute the data to all > > workers before computation starts, and your logic can depend on the data > > being available locally on each worker. > > > > Oh, cool. Something like that would be more than enough for now. Can you > please point me to any documentation or code I could use to play with it? > > > If this is not sufficient, Beam's side inputs are the right primitive. We > > support several access patterns for side inputs, including distributed > > lookup and various types of caching. This can work really well, > > particularly with a well-optimized runner. > > > > Interesting... any (early) documentation (or code) about such feature? > > > > > Other alternatives typically include access to a shared storage, which > is a > > lower-level approach and often requires more work. > > > Sure, share-storage is always an option, but for many reasons I'd rather > not resort to such approach. > > Thanks so much for all the ideas and valuable discussions! > > Cheers, > > -- > Sergio Fernández > Partner Technology Manager > Redlink GmbH > m: +43 6602747925 > e: [email protected] > w: http://redlink.co >
