Re: "Sharing" dataframes...

2017-06-21 Thread Pierce Lamb
Hi Jean, Since many in this thread have mentioned datastores from what I would call the "Spark datastore ecosystem" I thought I would link you to a StackOverflow answer I posted awhile back that tried to capture the majority of this ecosystem. Most would claim to allow you to do something like

Re: "Sharing" dataframes...

2017-06-21 Thread Gene Pang
Hi Jean, As others have mentioned, you can use Alluxio with Spark dataframes to keep the data in memory, and for other jobs to read them from memory again. Hope this helps, Gene On Wed, Jun 21, 2017 at 8:08 AM, Jean Georges

Re: "Sharing" dataframes...

2017-06-21 Thread Jean Georges Perrin
I have looked at Livy in the (very recent past) past and it will not do the trick for me. It seems pretty greedy in terms of resources (or at least that was our experience). I will investigate how job-server could do the trick. (on a side note I tried to find a paper on memory lifecycle within

Re: "Sharing" dataframes...

2017-06-21 Thread Michael Mior
This is a puzzling suggestion to me. It's unclear what features the OP needs, so it's really hard to say whether Livy or job-server aren't sufficient. It's true that neither are particularly mature, but they're much more mature than a homemade project which hasn't started yet. That said, I'm not

Re: "Sharing" dataframes...

2017-06-21 Thread Rick Moritz
Keeping it inside the same program/SparkContext is the most performant solution, since you can avoid serialization and deserialization. In-Memory-Persistance between jobs involves a memcopy, uses a lot of RAM and invokes serialization and deserialization. Technologies that can help you do that

Re: "Sharing" dataframes...

2017-06-20 Thread Jean Georges Perrin
Thanks Vadim & Jörn... I will look into those. jg > On Jun 20, 2017, at 2:12 PM, Vadim Semenov > wrote: > > You can launch one permanent spark context and then execute your jobs within > the context. And since they'll be running in the same context, they can

Re: "Sharing" dataframes...

2017-06-20 Thread Vadim Semenov
You can launch one permanent spark context and then execute your jobs within the context. And since they'll be running in the same context, they can share data easily. These two projects provide the functionality that you need:

Re: "Sharing" dataframes...

2017-06-20 Thread Jörn Franke
You could all express it in one program, alternatively ignite in memory file system or the ignite sharedrdd ( not sure if dataframe is supported) > On 20. Jun 2017, at 19:46, Jean Georges Perrin wrote: > > Hey, > > Here is my need: program A does something on a set of data and