subject:"Re\: Where to save intermediate results\?"

Re: Where to save intermediate results?

2014-09-02 Thread Daniel Siegmann

I don't have any personal experience with Spark Streaming. Whether you
store your data in HDFS or a database or something else probably depends on
the nature of your use case.

On Fri, Aug 29, 2014 at 10:38 AM, huylv huy.le...@insight-centre.org
wrote:

Hi Daniel,

Your suggestion is definitely an interesting approach. In fact, I already
have another system to deal with the stream analytical processing part. So
basically, the Spark job to aggregate data just accumulatively computes
aggregations from historical data together with new batch, which has been
partly summarized by the stream processor. Answering queries involves in
combining pre-calculated historical data together with on-stream
aggregations. This sounds much like what Spark Streaming is intended to do.
So I'll take a look deeper into Spark Streaming to consider porting the
stream processing part to use Spark Streaming.

Regarding saving pre-calculated data onto external storages (disk,
database...), I'm looking at Cassandra for now. But I don't know how it
fits
into my context and how is its performance compared to saving to files in
HDFS. Also, is there anyway to keep the precalculated data both on disk and
on memory, so that when the batch job terminated, historical data still
available on memory for combining with stream processor, while still be
able
to survive system failure or upgrade? Not to mention the size of
precalculated data might get too big, in that case, partly store newest
data
on memory only would be better. Tachyon looks like a nice option but again,
I don't have experience with it and it's still an experimental feature of
Spark.

Regards,
Huy

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Where-to-save-intermediate-results-tp13062p13127.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io

Re: Where to save intermediate results?

2014-08-29 Thread huylv

Hi Daniel,

Regarding saving pre-calculated data onto external storages (disk,
database...), I'm looking at Cassandra for now. But I don't know how it fits
into my context and how is its performance compared to saving to files in
HDFS. Also, is there anyway to keep the precalculated data both on disk and
on memory, so that when the batch job terminated, historical data still
available on memory for combining with stream processor, while still be able
to survive system failure or upgrade? Not to mention the size of
precalculated data might get too big, in that case, partly store newest data
on memory only would be better. Tachyon looks like a nice option but again,
I don't have experience with it and it's still an experimental feature of
Spark.

Regards,
Huy

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Where to save intermediate results?

2014-08-28 Thread Daniel Siegmann

I assume your on-demand calculations are a streaming flow? If your data
aggregated from batch isn't too large, maybe you should just save it to
disk; when your streaming flow starts you can read the aggregations back
from disk and perhaps just broadcast them. Though I guess you'd have to
restart your streaming flow when these aggregations are updated.

For something more sophisticated, maybe look at Redis http://redis.io/ or
some distributed database? Your ETL can update that store, and your
on-demand job can query it.

On Thu, Aug 28, 2014 at 4:30 PM, huylv huy.le...@insight-centre.org wrote:

Hi,

I'm building a system for near real-time data analytics. My plan is to have
an ETL batch job which calculates aggregations running periodically. User
queries are then parsed for on-demand calculations, also in Spark. Where
are
the pre-calculated results supposed to be saved? I mean, after finishing
aggregations, the ETL job will terminate, so caches are wiped out of
memory.
How can I use these results to calculate on-demand queries? Or more
generally, could you please give me a good way to organize the data flow
and
jobs in order to achieve this?

I'm new to Spark so sorry if this might sound like a dumb question.

Thank you.
Huy

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Where-to-save-intermediate-results-tp13062.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io

Re: Where to save intermediate results?

Re: Where to save intermediate results?

Re: Where to save intermediate results?

3 matches

Site Navigation

Mail list logo

Footer information