This is a puzzling suggestion to me. It's unclear what features the OP needs, so it's really hard to say whether Livy or job-server aren't sufficient. It's true that neither are particularly mature, but they're much more mature than a homemade project which hasn't started yet.
That said, I'm not very familiar with either project, so perhaps there are some big concerns I'm not aware of. -- Michael Mior mm...@apache.org 2017-06-21 3:19 GMT-04:00 Rick Moritz <rah...@gmail.com>: > Keeping it inside the same program/SparkContext is the most performant > solution, since you can avoid serialization and deserialization. > In-Memory-Persistance between jobs involves a memcopy, uses a lot of RAM > and invokes serialization and deserialization. Technologies that can help > you do that easily are Ignite (as mentioned) but also Alluxio, Cassandra > with in-memory tables and a memory-backed HDFS-directory (see tiered > storage). > Although livy and job-server provide the functionality of providing a > single SparkContext to mutliple programs, I would recommend you build your > own framework for integrating different jobs, since many features you may > need aren't present yet, while others may cause issues due to lack of > maturity. Artificially splitting jobs is in general a bad idea, since it > breaks the DAG and thus prevents some potential push-down optimizations. > > On Tue, Jun 20, 2017 at 10:17 PM, Jean Georges Perrin <j...@jgp.net> wrote: > >> Thanks Vadim & Jörn... I will look into those. >> >> jg >> >> On Jun 20, 2017, at 2:12 PM, Vadim Semenov <vadim.seme...@datadoghq.com> >> wrote: >> >> You can launch one permanent spark context and then execute your jobs >> within the context. And since they'll be running in the same context, they >> can share data easily. >> >> These two projects provide the functionality that you need: >> https://github.com/spark-jobserver/spark-jobserver#persisten >> t-context-mode---faster--required-for-related-jobs >> https://github.com/cloudera/livy#post-sessions >> >> On Tue, Jun 20, 2017 at 1:46 PM, Jean Georges Perrin <j...@jgp.net> wrote: >> >>> Hey, >>> >>> Here is my need: program A does something on a set of data and produces >>> results, program B does that on another set, and finally, program C >>> combines the data of A and B. Of course, the easy way is to dump all on >>> disk after A and B are done, but I wanted to avoid this. >>> >>> I was thinking of creating a temp view, but I do not really like the >>> temp aspect of it ;). Any idea (they are all worth sharing) >>> >>> jg >>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >> >> >