Re: Where to save intermediate results?

2014-09-02 Thread Daniel Siegmann
I don't have any personal experience with Spark Streaming. Whether you
store your data in HDFS or a database or something else probably depends on
the nature of your use case.


On Fri, Aug 29, 2014 at 10:38 AM, huylv huy.le...@insight-centre.org
wrote:

 Hi Daniel,

 Your suggestion is definitely an interesting approach. In fact, I already
 have another system to deal with the stream analytical processing part. So
 basically, the Spark job to aggregate data just accumulatively computes
 aggregations from historical data together with new batch, which has been
 partly summarized by the stream processor. Answering queries involves in
 combining pre-calculated historical data together with on-stream
 aggregations. This sounds much like what Spark Streaming is intended to do.
 So I'll take a look deeper into Spark Streaming to consider porting the
 stream processing part to use Spark Streaming.

 Regarding saving pre-calculated data onto external storages (disk,
 database...), I'm looking at Cassandra for now. But I don't know how it
 fits
 into my context and how is its performance compared to saving to files in
 HDFS. Also, is there anyway to keep the precalculated data both on disk and
 on memory, so that when the batch job terminated, historical data still
 available on memory for combining with stream processor, while still be
 able
 to survive system failure or upgrade? Not to mention the size of
 precalculated data might get too big, in that case, partly store newest
 data
 on memory only would be better. Tachyon looks like a nice option but again,
 I don't have experience with it and it's still an experimental feature of
 Spark.

 Regards,
 Huy



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Where-to-save-intermediate-results-tp13062p13127.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io


Re: Where to save intermediate results?

2014-08-29 Thread huylv
Hi Daniel,

Your suggestion is definitely an interesting approach. In fact, I already
have another system to deal with the stream analytical processing part. So
basically, the Spark job to aggregate data just accumulatively computes
aggregations from historical data together with new batch, which has been
partly summarized by the stream processor. Answering queries involves in
combining pre-calculated historical data together with on-stream
aggregations. This sounds much like what Spark Streaming is intended to do.
So I'll take a look deeper into Spark Streaming to consider porting the
stream processing part to use Spark Streaming.

Regarding saving pre-calculated data onto external storages (disk,
database...), I'm looking at Cassandra for now. But I don't know how it fits
into my context and how is its performance compared to saving to files in
HDFS. Also, is there anyway to keep the precalculated data both on disk and
on memory, so that when the batch job terminated, historical data still
available on memory for combining with stream processor, while still be able
to survive system failure or upgrade? Not to mention the size of
precalculated data might get too big, in that case, partly store newest data
on memory only would be better. Tachyon looks like a nice option but again,
I don't have experience with it and it's still an experimental feature of
Spark.

Regards,
Huy



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Where-to-save-intermediate-results-tp13062p13127.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Where to save intermediate results?

2014-08-28 Thread Daniel Siegmann
I assume your on-demand calculations are a streaming flow? If your data
aggregated from batch isn't too large, maybe you should just save it to
disk; when your streaming flow starts you can read the aggregations back
from disk and perhaps just broadcast them. Though I guess you'd have to
restart your streaming flow when these aggregations are updated.

For something more sophisticated, maybe look at Redis http://redis.io/ or
some distributed database? Your ETL can update that store, and your
on-demand job can query it.


On Thu, Aug 28, 2014 at 4:30 PM, huylv huy.le...@insight-centre.org wrote:

 Hi,

 I'm building a system for near real-time data analytics. My plan is to have
 an ETL batch job which calculates aggregations running periodically. User
 queries are then parsed for on-demand calculations, also in Spark. Where
 are
 the pre-calculated results supposed to be saved? I mean, after finishing
 aggregations, the ETL job will terminate, so caches are wiped out of
 memory.
 How can I use these results to calculate on-demand queries? Or more
 generally, could you please give me a good way to organize the data flow
 and
 jobs in order to achieve this?

 I'm new to Spark so sorry if this might sound like a dumb question.

 Thank you.
 Huy



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Where-to-save-intermediate-results-tp13062.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io