Is the Manager a python multiprocessing manager? Why are you using parallelism on python when theoretically most of the heavy lifting is done via spark?
On Wed, Oct 28, 2015 at 4:27 PM agg212 <a...@cs.brown.edu> wrote: > I would just like to be able to put a Spark DataFrame in a manager.dict() > and > be able to get it out (manager.dict() calls pickle on the object being > stored). Ideally, I would just like to store a pointer to the DataFrame > object so that it remains distributed within Spark (i.e., not materialize > and then store). Here is an example: > > data = sparkContext.jsonFile(data_file) #load file > cache = Manager.dict() #thread-safe container > cache['id'] = data #store reference to data, not materialized result > new_data = cache['id'] #get reference to distributed spark dataframe > new_data.show() > > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/Pickle-Spark-DataFrame-tp14803p14825.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >