I assume your on-demand calculations are a streaming flow? If your data
aggregated from batch isn't too large, maybe you should just save it to
disk; when your streaming flow starts you can read the aggregations back
from disk and perhaps just broadcast them. Though I guess you'd have to
restart your streaming flow when these aggregations are updated.

For something more sophisticated, maybe look at Redis <http://redis.io/> or
some distributed database? Your ETL can update that store, and your
on-demand job can query it.


On Thu, Aug 28, 2014 at 4:30 PM, huylv <huy.le...@insight-centre.org> wrote:

> Hi,
>
> I'm building a system for near real-time data analytics. My plan is to have
> an ETL batch job which calculates aggregations running periodically. User
> queries are then parsed for on-demand calculations, also in Spark. Where
> are
> the pre-calculated results supposed to be saved? I mean, after finishing
> aggregations, the ETL job will terminate, so caches are wiped out of
> memory.
> How can I use these results to calculate on-demand queries? Or more
> generally, could you please give me a good way to organize the data flow
> and
> jobs in order to achieve this?
>
> I'm new to Spark so sorry if this might sound like a dumb question.
>
> Thank you.
> Huy
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Where-to-save-intermediate-results-tp13062.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io

Reply via email to