I am already creating these files on slave. How can i create an RDD from these slaves?
Regards Sumit Chawla On Thu, Dec 15, 2016 at 11:42 AM, Reynold Xin <r...@databricks.com> wrote: > You can just write some files out directly (and idempotently) in your > map/mapPartitions functions. It is just a function that you can run > arbitrary code after all. > > > On Thu, Dec 15, 2016 at 11:33 AM, Chawla,Sumit <sumitkcha...@gmail.com> > wrote: > >> Any suggestions on this one? >> >> Regards >> Sumit Chawla >> >> >> On Tue, Dec 13, 2016 at 8:31 AM, Chawla,Sumit <sumitkcha...@gmail.com> >> wrote: >> >>> Hi All >>> >>> I have a workflow with different steps in my program. Lets say these are >>> steps A, B, C, D. Step B produces some temp files on each executor node. >>> How can i add another step E which consumes these files? >>> >>> I understand the easiest choice is to copy all these temp files to any >>> shared location, and then step E can create another RDD from it and work on >>> that. But i am trying to avoid this copy. I was wondering if there is any >>> way i can queue up these files for E as they are getting generated on >>> executors. Is there any possibility of creating a dummy RDD in start of >>> program, and then push these files into this RDD from each executor. >>> >>> I take my inspiration from the concept of Side Outputs in Google >>> Dataflow: >>> >>> https://cloud.google.com/dataflow/model/par-do#emitting-to-s >>> ide-outputs-in-your-dofn >>> >>> >>> >>> Regards >>> Sumit Chawla >>> >>> >> >