Re: Where to save intermediate results?
I don't have any personal experience with Spark Streaming. Whether you store your data in HDFS or a database or something else probably depends on the nature of your use case. On Fri, Aug 29, 2014 at 10:38 AM, huylv huy.le...@insight-centre.org wrote: Hi Daniel, Your suggestion is definitely an interesting approach. In fact, I already have another system to deal with the stream analytical processing part. So basically, the Spark job to aggregate data just accumulatively computes aggregations from historical data together with new batch, which has been partly summarized by the stream processor. Answering queries involves in combining pre-calculated historical data together with on-stream aggregations. This sounds much like what Spark Streaming is intended to do. So I'll take a look deeper into Spark Streaming to consider porting the stream processing part to use Spark Streaming. Regarding saving pre-calculated data onto external storages (disk, database...), I'm looking at Cassandra for now. But I don't know how it fits into my context and how is its performance compared to saving to files in HDFS. Also, is there anyway to keep the precalculated data both on disk and on memory, so that when the batch job terminated, historical data still available on memory for combining with stream processor, while still be able to survive system failure or upgrade? Not to mention the size of precalculated data might get too big, in that case, partly store newest data on memory only would be better. Tachyon looks like a nice option but again, I don't have experience with it and it's still an experimental feature of Spark. Regards, Huy -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Where-to-save-intermediate-results-tp13062p13127.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io
Re: Where to save intermediate results?
Hi Daniel, Your suggestion is definitely an interesting approach. In fact, I already have another system to deal with the stream analytical processing part. So basically, the Spark job to aggregate data just accumulatively computes aggregations from historical data together with new batch, which has been partly summarized by the stream processor. Answering queries involves in combining pre-calculated historical data together with on-stream aggregations. This sounds much like what Spark Streaming is intended to do. So I'll take a look deeper into Spark Streaming to consider porting the stream processing part to use Spark Streaming. Regarding saving pre-calculated data onto external storages (disk, database...), I'm looking at Cassandra for now. But I don't know how it fits into my context and how is its performance compared to saving to files in HDFS. Also, is there anyway to keep the precalculated data both on disk and on memory, so that when the batch job terminated, historical data still available on memory for combining with stream processor, while still be able to survive system failure or upgrade? Not to mention the size of precalculated data might get too big, in that case, partly store newest data on memory only would be better. Tachyon looks like a nice option but again, I don't have experience with it and it's still an experimental feature of Spark. Regards, Huy -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Where-to-save-intermediate-results-tp13062p13127.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Where to save intermediate results?
I assume your on-demand calculations are a streaming flow? If your data aggregated from batch isn't too large, maybe you should just save it to disk; when your streaming flow starts you can read the aggregations back from disk and perhaps just broadcast them. Though I guess you'd have to restart your streaming flow when these aggregations are updated. For something more sophisticated, maybe look at Redis http://redis.io/ or some distributed database? Your ETL can update that store, and your on-demand job can query it. On Thu, Aug 28, 2014 at 4:30 PM, huylv huy.le...@insight-centre.org wrote: Hi, I'm building a system for near real-time data analytics. My plan is to have an ETL batch job which calculates aggregations running periodically. User queries are then parsed for on-demand calculations, also in Spark. Where are the pre-calculated results supposed to be saved? I mean, after finishing aggregations, the ETL job will terminate, so caches are wiped out of memory. How can I use these results to calculate on-demand queries? Or more generally, could you please give me a good way to organize the data flow and jobs in order to achieve this? I'm new to Spark so sorry if this might sound like a dumb question. Thank you. Huy -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Where-to-save-intermediate-results-tp13062.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io