We are in process of porting Cloud Dataflow documentation to Beam, so I'll
give you a mix of Dataflow and Beam links.

FilesToStage is a pipeline option [1], [2]. Super-easy to use.
Side inputs are a ParDo concept [3].

If you hit any rough edges, please let us know -- I'd be glad to help!

[1]
https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options
[2]
https://beam.incubator.apache.org/javadoc/0.1.0-incubating/org/apache/beam/runners/dataflow/options/DataflowPipelineWorkerPoolOptions.html#getFilesToStage--
[3] https://cloud.google.com/dataflow/model/par-do#side-inputs

On Thu, Jun 16, 2016 at 1:40 AM, Sergio Fernández <[email protected]> wrote:

> Hi Davor,
>
> On Thu, Jun 16, 2016 at 3:04 AM, Davor Bonaci <[email protected]>
> wrote:
>
> > This is a really good question, Sergio. You got right away to the crux of
> > the problem -- how to express such pattern in the Beam model.
> >
> > The answer depends whether the data is static, e.g., whether it is known
> at
> > pipeline construction time / computed in the earlier stages of the
> > pipeline, or perhaps evolving during pipeline execution. I'll give a
> > high-level answer -- feel free to share more information about your use
> > case and we can drill into specific details.
> >
>
> Well, as a said, for us is more interesting to use Beam in processing time
> that for training purposes. In the past we have experimented a bit with
> approaches like TensorSpark <https://github.com/adatao/tensorspark>, but
> the critical aspect is exploitation of the models. Therefore we could
> assume the models are static data.
>
>
>
> > In the simplest case, Beam supports "files to stage" concept if the data
> is
> > known apriori. In this case, runners will distribute the data to all
> > workers before computation starts, and your logic can depend on the data
> > being available locally on each worker.
> >
>
> Oh, cool. Something like that would be more than enough for now. Can you
> please point me to any documentation or code I could use to play with it?
>
>
> If this is not sufficient, Beam's side inputs are the right primitive. We
> > support several access patterns for side inputs, including distributed
> > lookup and various types of caching. This can work really well,
> > particularly with a well-optimized runner.
> >
>
> Interesting... any (early) documentation (or code) about such feature?
>
>
>
> > Other alternatives typically include access to a shared storage, which
> is a
> > lower-level approach and often requires more work.
>
>
> Sure, share-storage is always an option, but for many reasons I'd rather
> not resort to such approach.
>
> Thanks so much for all the ideas and valuable discussions!
>
> Cheers,
>
> --
> Sergio Fernández
> Partner Technology Manager
> Redlink GmbH
> m: +43 6602747925
> e: [email protected]
> w: http://redlink.co
>

Reply via email to