On Fri, Nov 25, 2016 at 2:36 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
> By the way, you can also use TensorFrame allowing you to use TensorFlow
> directly with Spark dataframe, and more direct access. I discussed with Tim
> Hunter from Databricks about that who's working on TensorFrame.
Yes, we have been discussing and experimenting a bit with TensorFrame. The
work is very interesting, although it has some limitations. So actually
that would mean take a step back in our plan of getting away from the
specifics of the concrete processing engine.
Back on Beam, what you could do:
> 1. you expose the service on a microservice container (for instance Apache
> Karaf ;))
> In your pipeline, you have two options:
> 2.a. in your Beam pipeline, in a DoFn, in the @Setup you can create the
> REST client (using CXF, or whatever), and in the @ProcessElement you can
> use the service (hosted by Karaf)
Besides a different microservice infrastructure, I already started to play
with DoFn and the concepts around.
2.b. I also have a RestIO (source and sink) that can request a REST
> endpoint. However, for now, this IO acts as a pipeline endpoint
> (PTransform<PBegin, PCollection> or PTransform<PCollection, PDone>). In
> your case, if the service called is a step of your pipeline, ParDo(your
> DoFn) would be easier.
Yes, that's was what I understood of the Beam design. IO is expected for
the head or the tail of the pipeline.
> Is it what you mean by microservice ?
Yep, exactly that.
Thanks so much!
On 11/25/2016 01:18 PM, Sergio Fernández wrote:
> Hi JB,
> On Tue, Nov 22, 2016 at 11:14 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
>> DoFn will execute per element (with eventually a hook on StartBundle,
>> FinishBundle, and Teardown). It's basic the way it works in IO WriteFn: we
>> create the connection in StartBundle and send each element (with a batch)
>> to external resource.
>> PTransform is maybe more flexible in case of interact with "outside"
> Probably PTransform would be a better place. I'm still pretty new to some
> of the Beam terms and apis.
> Do you have use case to be sure I understand ?
> Yes, Well, it's far more complex, but this question I can simplify it:
> We have a TensorFlow-based classifier. In our pipeline one step performs
> that classification of the data. Currently it's implemented as a Spark
> Function, because TensorFlow models can directly be embedded within
> pipelines using PySpark.
> Therefore I'm looking for the best option to move such classification
> process one level up in the abstraction with Beam, so I could make it
> portable. The first idea I'm exploring is relying on a external function
> (i.e., microservice) that I'd need to scale up and down independently of
> the pipeline. So I'm more than welcome to discuss ideas ;-)
> On 11/22/2016 10:39 AM, Sergio Fernández wrote:
>>> I'd like resume the idea to have TensorFlow-based tasks running in a Beam
>>> Pipeline. So far the cleaner approach I can imagine would be to have it
>>> running outside (Functions in GCP, Lambdas in AWS, Microservices
>>> Therefore, does the current Beam model provide the sense of a DoFn which
>>> actually runs externally?
>>> Thanks in advance for the feedback.
>> Jean-Baptiste Onofré
>> Talend - http://www.talend.com
>> Sergio Fernández
>> Partner Technology Manager
>> Redlink GmbH
>> m: +43 6602747925
>> e: <http://www.talend.com>sergio.fernan...@redlink.co
>> w: http://redlink.co
Talend - http://www.talend.com
Partner Technology Manager
m: +43 6602747925