Hi Daniel, Your suggestion is definitely an interesting approach. In fact, I already have another system to deal with the stream analytical processing part. So basically, the Spark job to aggregate data just accumulatively computes aggregations from historical data together with new batch, which has been partly summarized by the stream processor. Answering queries involves in combining pre-calculated historical data together with on-stream aggregations. This sounds much like what Spark Streaming is intended to do. So I'll take a look deeper into Spark Streaming to consider porting the stream processing part to use Spark Streaming.
Regarding saving pre-calculated data onto external storages (disk, database...), I'm looking at Cassandra for now. But I don't know how it fits into my context and how is its performance compared to saving to files in HDFS. Also, is there anyway to keep the precalculated data both on disk and on memory, so that when the batch job terminated, historical data still available on memory for combining with stream processor, while still be able to survive system failure or upgrade? Not to mention the size of precalculated data might get too big, in that case, partly store newest data on memory only would be better. Tachyon looks like a nice option but again, I don't have experience with it and it's still an experimental feature of Spark. Regards, Huy -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Where-to-save-intermediate-results-tp13062p13127.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org