Can you give full code? especially the myfunc? On Wed, Apr 22, 2015 at 2:20 PM, Vadim Bichutskiy < vadim.bichuts...@gmail.com> wrote:
> Here's what I did: > > print 'BROADCASTING...' > broadcastVar = sc.broadcast(mylist) > print broadcastVar > print broadcastVar.value > print 'FINISHED BROADCASTING...' > > The above works fine, > > but when I call myrdd.map(myfunc) I get *NameError: global name > 'broadcastVar' is not defined* > > The myfunc function is in a different module. How do I make it aware of > broadcastVar? > ᐧ > > On Wed, Apr 22, 2015 at 2:13 PM, Vadim Bichutskiy < > vadim.bichuts...@gmail.com> wrote: > >> Great. Will try to modify the code. Always room to optimize! >> ᐧ >> >> On Wed, Apr 22, 2015 at 2:11 PM, Tathagata Das <t...@databricks.com> >> wrote: >> >>> Absolutely. The same code would work for local as well as distributed >>> mode! >>> >>> On Wed, Apr 22, 2015 at 11:08 AM, Vadim Bichutskiy < >>> vadim.bichuts...@gmail.com> wrote: >>> >>>> Can I use broadcast vars in local mode? >>>> ᐧ >>>> >>>> On Wed, Apr 22, 2015 at 2:06 PM, Tathagata Das <t...@databricks.com> >>>> wrote: >>>> >>>>> Yep. Not efficient. Pretty bad actually. That's why broadcast variable >>>>> were introduced right at the very beginning of Spark. >>>>> >>>>> >>>>> >>>>> On Wed, Apr 22, 2015 at 10:58 AM, Vadim Bichutskiy < >>>>> vadim.bichuts...@gmail.com> wrote: >>>>> >>>>>> Thanks TD. I was looking into broadcast variables. >>>>>> >>>>>> Right now I am running it locally...and I plan to move it to >>>>>> "production" on EC2. >>>>>> >>>>>> The way I fixed it is by doing myrdd.map(lambda x: (x, >>>>>> mylist)).map(myfunc) but I don't think it's efficient? >>>>>> >>>>>> mylist is filled only once at the start and never changes. >>>>>> >>>>>> Vadim >>>>>> ᐧ >>>>>> >>>>>> On Wed, Apr 22, 2015 at 1:42 PM, Tathagata Das <t...@databricks.com> >>>>>> wrote: >>>>>> >>>>>>> Is the mylist present on every executor? If not, then you have to >>>>>>> pass it on. And broadcasts are the best way to pass them on. But note >>>>>>> that >>>>>>> once broadcasted it will immutable at the executors, and if you update >>>>>>> the >>>>>>> list at the driver, you will have to broadcast it again. >>>>>>> >>>>>>> TD >>>>>>> >>>>>>> On Wed, Apr 22, 2015 at 9:28 AM, Vadim Bichutskiy < >>>>>>> vadim.bichuts...@gmail.com> wrote: >>>>>>> >>>>>>>> I am using Spark Streaming with Python. For each RDD, I call a map, >>>>>>>> i.e., myrdd.map(myfunc), myfunc is in a separate Python module. In yet >>>>>>>> another separate Python module I have a global list, i.e. mylist, >>>>>>>> that's populated with metadata. I can't get myfunc to see mylist...it's >>>>>>>> always empty. Alternatively, I guess I could pass mylist to map. >>>>>>>> >>>>>>>> Any suggestions? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Vadim >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >