Any suggestions on this one?

Regards
Sumit Chawla


On Tue, Dec 13, 2016 at 8:31 AM, Chawla,Sumit <sumitkcha...@gmail.com>
wrote:

> Hi All
>
> I have a workflow with different steps in my program. Lets say these are
> steps A, B, C, D.  Step B produces some temp files on each executor node.
> How can i add another step E which consumes these files?
>
> I understand the easiest choice is  to copy all these temp files to any
> shared location, and then step E can create another RDD from it and work on
> that.  But i am trying to avoid this copy.  I was wondering if there is any
> way i can queue up these files for E as they are getting generated on
> executors.  Is there any possibility of creating a dummy RDD in start of
> program, and then push these files into this RDD from each executor.
>
> I take my inspiration from the concept of Side Outputs in Google Dataflow:
>
> https://cloud.google.com/dataflow/model/par-do#
> emitting-to-side-outputs-in-your-dofn
>
>
>
> Regards
> Sumit Chawla
>
>

Reply via email to