I assume your on-demand calculations are a streaming flow? If your data aggregated from batch isn't too large, maybe you should just save it to disk; when your streaming flow starts you can read the aggregations back from disk and perhaps just broadcast them. Though I guess you'd have to restart your streaming flow when these aggregations are updated.
For something more sophisticated, maybe look at Redis <http://redis.io/> or some distributed database? Your ETL can update that store, and your on-demand job can query it. On Thu, Aug 28, 2014 at 4:30 PM, huylv <huy.le...@insight-centre.org> wrote: > Hi, > > I'm building a system for near real-time data analytics. My plan is to have > an ETL batch job which calculates aggregations running periodically. User > queries are then parsed for on-demand calculations, also in Spark. Where > are > the pre-calculated results supposed to be saved? I mean, after finishing > aggregations, the ETL job will terminate, so caches are wiped out of > memory. > How can I use these results to calculate on-demand queries? Or more > generally, could you please give me a good way to organize the data flow > and > jobs in order to achieve this? > > I'm new to Spark so sorry if this might sound like a dumb question. > > Thank you. > Huy > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Where-to-save-intermediate-results-tp13062.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io