Any suggestions on this one? Regards Sumit Chawla
On Tue, Dec 13, 2016 at 8:31 AM, Chawla,Sumit <sumitkcha...@gmail.com> wrote: > Hi All > > I have a workflow with different steps in my program. Lets say these are > steps A, B, C, D. Step B produces some temp files on each executor node. > How can i add another step E which consumes these files? > > I understand the easiest choice is to copy all these temp files to any > shared location, and then step E can create another RDD from it and work on > that. But i am trying to avoid this copy. I was wondering if there is any > way i can queue up these files for E as they are getting generated on > executors. Is there any possibility of creating a dummy RDD in start of > program, and then push these files into this RDD from each executor. > > I take my inspiration from the concept of Side Outputs in Google Dataflow: > > https://cloud.google.com/dataflow/model/par-do# > emitting-to-side-outputs-in-your-dofn > > > > Regards > Sumit Chawla > >