Re: "Sharing" dataframes...

Michael Mior Wed, 21 Jun 2017 07:48:45 -0700

This is a puzzling suggestion to me. It's unclear what features the OP
needs, so it's really hard to say whether Livy or job-server aren't
sufficient. It's true that neither are particularly mature, but they're
much more mature than a homemade project which hasn't started yet.


That said, I'm not very familiar with either project, so perhaps there are
some big concerns I'm not aware of.

--
Michael Mior
mm...@apache.org

2017-06-21 3:19 GMT-04:00 Rick Moritz <rah...@gmail.com>:

> Keeping it inside the same program/SparkContext is the most performant
> solution, since you can avoid serialization and deserialization.
> In-Memory-Persistance between jobs involves a memcopy, uses a lot of RAM
> and invokes serialization and deserialization. Technologies that can help
> you do that easily are Ignite (as mentioned) but also Alluxio, Cassandra
> with in-memory tables and a memory-backed HDFS-directory (see tiered
> storage).
> Although livy and job-server provide the functionality of providing a
> single SparkContext to mutliple programs, I would recommend you build your
> own framework for integrating different jobs, since many features you may
> need aren't present yet, while others may cause issues due to lack of
> maturity. Artificially splitting jobs is in general a bad idea, since it
> breaks the DAG and thus prevents some potential push-down optimizations.
>
> On Tue, Jun 20, 2017 at 10:17 PM, Jean Georges Perrin <j...@jgp.net> wrote:
>
>> Thanks Vadim & Jörn... I will look into those.
>>
>> jg
>>
>> On Jun 20, 2017, at 2:12 PM, Vadim Semenov <vadim.seme...@datadoghq.com>
>> wrote:
>>
>> You can launch one permanent spark context and then execute your jobs
>> within the context. And since they'll be running in the same context, they
>> can share data easily.
>>
>> These two projects provide the functionality that you need:
>> https://github.com/spark-jobserver/spark-jobserver#persisten
>> t-context-mode---faster--required-for-related-jobs
>> https://github.com/cloudera/livy#post-sessions
>>
>> On Tue, Jun 20, 2017 at 1:46 PM, Jean Georges Perrin <j...@jgp.net> wrote:
>>
>>> Hey,
>>>
>>> Here is my need: program A does something on a set of data and produces
>>> results, program B does that on another set, and finally, program C
>>> combines the data of A and B. Of course, the easy way is to dump all on
>>> disk after A and B are done, but I wanted to avoid this.
>>>
>>> I was thinking of creating a temp view, but I do not really like the
>>> temp aspect of it ;). Any idea (they are all worth sharing)
>>>
>>> jg
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>>
>

Re: "Sharing" dataframes...

Reply via email to