Re: Cassandra for Analytics?

2014-12-18 Thread Ryan Svihla
I'd argue the higher latency for reads than HBase, I'm not sure of what
experience you have with both, and that may have been true at one point,
but with Leveled Compaction Strategy and proper JVM tunings I'm not sure
how this is true, it would at least be comparable. I've worked with buffer
cached configured clusters where the 99th percentile read is sub 400
microseconds.

Spark and Cassandra when combined are a common fit and use case for real
time analytics and Ooyala has been doing this for some time. They're a
number of Youtube videos where they talk about it
https://www.youtube.com/watch?v=PjZp7K5z7ew

On Wed, Dec 17, 2014 at 10:20 PM, Ajay ajay.ga...@gmail.com wrote:

 Hi,

 Can Cassandra be used or best fit for Real Time Analytics? I went through
 couple of benchmark between Cassandra Vs HBase (most of it was done 3 years
 ago) and it mentioned that Cassandra is designed for intensive writes and
 Cassandra has higher latency for reads than HBase. In our case, we will
 have writes and reads (but reads will be more say 40% writes and 60%
 reads). We are planning to use Spark as the in memory computation engine.

 Thanks
 Ajay



-- 

[image: datastax_logo.png] http://www.datastax.com/

Ryan Svihla

Solution Architect

[image: twitter.png] https://twitter.com/foundev [image: linkedin.png]
http://www.linkedin.com/pub/ryan-svihla/12/621/727/

DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.


Re: Cassandra for Analytics?

2014-12-18 Thread Peter Lin
that depends on what you mean by real-time analytics.

For things like continuous data streams, neither are appropriate platforms
for doing analytics. They're good for storing the results (aka output) of
the streaming analytics. I would suggest before you decide cassandra vs
hbase, first figure out exactly what kind of analytics you need to do.
Start with prototyping and look at what kind of queries and patterns you
need to support.

neither hbase or cassandra are good for complex patterns that do joins or
cross joins (aka mdx), so using either one you have to re-invent stuff.

most of the event processing and stream processing products out there also
don't support joins or cross joins very well, so any solution is going to
need several different components. typically stream processing does
filtering, which feeds another system that does simple joins. The output of
the second step can then go to another system that does mdx style queries.

spark streaming has basic support, but it's not as mature and feature rich
as other stream processing products.

On Wed, Dec 17, 2014 at 11:20 PM, Ajay ajay.ga...@gmail.com wrote:

 Hi,

 Can Cassandra be used or best fit for Real Time Analytics? I went through
 couple of benchmark between Cassandra Vs HBase (most of it was done 3 years
 ago) and it mentioned that Cassandra is designed for intensive writes and
 Cassandra has higher latency for reads than HBase. In our case, we will
 have writes and reads (but reads will be more say 40% writes and 60%
 reads). We are planning to use Spark as the in memory computation engine.

 Thanks
 Ajay



Re: Cassandra for Analytics?

2014-12-18 Thread Ryan Svihla
Since Ajay is already using spark the Spark Cassandra Connector really gets
them where they want to be pretty easily
https://github.com/datastax/spark-cassandra-connector (joins, etc).

As far as spark streaming having basic support I'd challenge that
assertion (namely Storm has a number of problems with delivery guarantees
that Spark basically solves), however, this isn't a Spark mailing list, and
perhaps this conversation is better had there.

If the question Is Cassandra used in real time analytics cases with
Spark? the answer is absolutely yes (and Storm for that matter). If the
question is Can you do your analytics queries on Cassandra while you have
Spark sitting there doing nothing? then of course the answer is no, but
that'd be a bizzare question, they already have Spark in use.

On Thu, Dec 18, 2014 at 6:52 AM, Peter Lin wool...@gmail.com wrote:

 that depends on what you mean by real-time analytics.

 For things like continuous data streams, neither are appropriate platforms
 for doing analytics. They're good for storing the results (aka output) of
 the streaming analytics. I would suggest before you decide cassandra vs
 hbase, first figure out exactly what kind of analytics you need to do.
 Start with prototyping and look at what kind of queries and patterns you
 need to support.

 neither hbase or cassandra are good for complex patterns that do joins or
 cross joins (aka mdx), so using either one you have to re-invent stuff.

 most of the event processing and stream processing products out there also
 don't support joins or cross joins very well, so any solution is going to
 need several different components. typically stream processing does
 filtering, which feeds another system that does simple joins. The output of
 the second step can then go to another system that does mdx style queries.

 spark streaming has basic support, but it's not as mature and feature rich
 as other stream processing products.

 On Wed, Dec 17, 2014 at 11:20 PM, Ajay ajay.ga...@gmail.com wrote:

 Hi,

 Can Cassandra be used or best fit for Real Time Analytics? I went through
 couple of benchmark between Cassandra Vs HBase (most of it was done 3 years
 ago) and it mentioned that Cassandra is designed for intensive writes and
 Cassandra has higher latency for reads than HBase. In our case, we will
 have writes and reads (but reads will be more say 40% writes and 60%
 reads). We are planning to use Spark as the in memory computation engine.

 Thanks
 Ajay



-- 

[image: datastax_logo.png] http://www.datastax.com/

Ryan Svihla

Solution Architect

[image: twitter.png] https://twitter.com/foundev [image: linkedin.png]
http://www.linkedin.com/pub/ryan-svihla/12/621/727/

DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.


Re: Cassandra for Analytics?

2014-12-18 Thread Peter Lin
some of the most common types of use cases in stream processing is sliding
windows based on time or count. Based on my understanding of spark
architecture and spark streaming, it does not provide the same
functionality. One can fake it by setting spark streaming to really small
micro-batches, but that's not the same.

if the use case fits that model, than using spark is fine. For other kinds
of use cases, spark may not be a good fit. Some people store all events
before analyzing it, which works for some use cases. While other uses cases
like trading systems, store before analysis isn't feasible or practical.
Other use cases like command control also don't fit store before analysis
model.

Try to avoid putting the cart infront of the horse. Picking a tool before
you have a clear understanding of the problem is a good recipe for disaster

On Thu, Dec 18, 2014 at 8:04 AM, Ryan Svihla rsvi...@datastax.com wrote:

 Since Ajay is already using spark the Spark Cassandra Connector really
 gets them where they want to be pretty easily
 https://github.com/datastax/spark-cassandra-connector (joins, etc).

 As far as spark streaming having basic support I'd challenge that
 assertion (namely Storm has a number of problems with delivery guarantees
 that Spark basically solves), however, this isn't a Spark mailing list, and
 perhaps this conversation is better had there.

 If the question Is Cassandra used in real time analytics cases with
 Spark? the answer is absolutely yes (and Storm for that matter). If the
 question is Can you do your analytics queries on Cassandra while you have
 Spark sitting there doing nothing? then of course the answer is no, but
 that'd be a bizzare question, they already have Spark in use.

 On Thu, Dec 18, 2014 at 6:52 AM, Peter Lin wool...@gmail.com wrote:

 that depends on what you mean by real-time analytics.

 For things like continuous data streams, neither are appropriate
 platforms for doing analytics. They're good for storing the results (aka
 output) of the streaming analytics. I would suggest before you decide
 cassandra vs hbase, first figure out exactly what kind of analytics you
 need to do. Start with prototyping and look at what kind of queries and
 patterns you need to support.

 neither hbase or cassandra are good for complex patterns that do joins or
 cross joins (aka mdx), so using either one you have to re-invent stuff.

 most of the event processing and stream processing products out there
 also don't support joins or cross joins very well, so any solution is going
 to need several different components. typically stream processing does
 filtering, which feeds another system that does simple joins. The output of
 the second step can then go to another system that does mdx style queries.

 spark streaming has basic support, but it's not as mature and feature
 rich as other stream processing products.

 On Wed, Dec 17, 2014 at 11:20 PM, Ajay ajay.ga...@gmail.com wrote:

 Hi,

 Can Cassandra be used or best fit for Real Time Analytics? I went
 through couple of benchmark between Cassandra Vs HBase (most of it was done
 3 years ago) and it mentioned that Cassandra is designed for intensive
 writes and Cassandra has higher latency for reads than HBase. In our case,
 we will have writes and reads (but reads will be more say 40% writes and
 60% reads). We are planning to use Spark as the in memory computation
 engine.

 Thanks
 Ajay



 --

 [image: datastax_logo.png] http://www.datastax.com/

 Ryan Svihla

 Solution Architect

 [image: twitter.png] https://twitter.com/foundev [image: linkedin.png]
 http://www.linkedin.com/pub/ryan-svihla/12/621/727/

 DataStax is the fastest, most scalable distributed database technology,
 delivering Apache Cassandra to the world’s most innovative enterprises.
 Datastax is built to be agile, always-on, and predictably scalable to any
 size. With more than 500 customers in 45 countries, DataStax is the
 database technology and transactional backbone of choice for the worlds
 most innovative companies such as Netflix, Adobe, Intuit, and eBay.




Re: Cassandra for Analytics?

2014-12-18 Thread Ryan Svihla
I'll decline to continue the commentary on spark, as again this probably
belongs on another list, other than to say, microbatches is an intentional
design tradeoff that has notable benefits for the same use cases you're
referring too, and that while you may disagree with those tradeoffs, it's a
bit harsh to dismiss as basic something that was chosen and provides some
improvements over say..the Storm model.

On Thu, Dec 18, 2014 at 7:13 AM, Peter Lin wool...@gmail.com wrote:


 some of the most common types of use cases in stream processing is sliding
 windows based on time or count. Based on my understanding of spark
 architecture and spark streaming, it does not provide the same
 functionality. One can fake it by setting spark streaming to really small
 micro-batches, but that's not the same.

 if the use case fits that model, than using spark is fine. For other kinds
 of use cases, spark may not be a good fit. Some people store all events
 before analyzing it, which works for some use cases. While other uses cases
 like trading systems, store before analysis isn't feasible or practical.
 Other use cases like command control also don't fit store before analysis
 model.

 Try to avoid putting the cart infront of the horse. Picking a tool before
 you have a clear understanding of the problem is a good recipe for disaster

 On Thu, Dec 18, 2014 at 8:04 AM, Ryan Svihla rsvi...@datastax.com wrote:

 Since Ajay is already using spark the Spark Cassandra Connector really
 gets them where they want to be pretty easily
 https://github.com/datastax/spark-cassandra-connector (joins, etc).

 As far as spark streaming having basic support I'd challenge that
 assertion (namely Storm has a number of problems with delivery guarantees
 that Spark basically solves), however, this isn't a Spark mailing list, and
 perhaps this conversation is better had there.

 If the question Is Cassandra used in real time analytics cases with
 Spark? the answer is absolutely yes (and Storm for that matter). If the
 question is Can you do your analytics queries on Cassandra while you have
 Spark sitting there doing nothing? then of course the answer is no, but
 that'd be a bizzare question, they already have Spark in use.

 On Thu, Dec 18, 2014 at 6:52 AM, Peter Lin wool...@gmail.com wrote:

 that depends on what you mean by real-time analytics.

 For things like continuous data streams, neither are appropriate
 platforms for doing analytics. They're good for storing the results (aka
 output) of the streaming analytics. I would suggest before you decide
 cassandra vs hbase, first figure out exactly what kind of analytics you
 need to do. Start with prototyping and look at what kind of queries and
 patterns you need to support.

 neither hbase or cassandra are good for complex patterns that do joins
 or cross joins (aka mdx), so using either one you have to re-invent stuff.

 most of the event processing and stream processing products out there
 also don't support joins or cross joins very well, so any solution is going
 to need several different components. typically stream processing does
 filtering, which feeds another system that does simple joins. The output of
 the second step can then go to another system that does mdx style queries.

 spark streaming has basic support, but it's not as mature and feature
 rich as other stream processing products.

 On Wed, Dec 17, 2014 at 11:20 PM, Ajay ajay.ga...@gmail.com wrote:

 Hi,

 Can Cassandra be used or best fit for Real Time Analytics? I went
 through couple of benchmark between Cassandra Vs HBase (most of it was done
 3 years ago) and it mentioned that Cassandra is designed for intensive
 writes and Cassandra has higher latency for reads than HBase. In our case,
 we will have writes and reads (but reads will be more say 40% writes and
 60% reads). We are planning to use Spark as the in memory computation
 engine.

 Thanks
 Ajay



 --

 [image: datastax_logo.png] http://www.datastax.com/

 Ryan Svihla

 Solution Architect

 [image: twitter.png] https://twitter.com/foundev [image: linkedin.png]
 http://www.linkedin.com/pub/ryan-svihla/12/621/727/

 DataStax is the fastest, most scalable distributed database technology,
 delivering Apache Cassandra to the world’s most innovative enterprises.
 Datastax is built to be agile, always-on, and predictably scalable to any
 size. With more than 500 customers in 45 countries, DataStax is the
 database technology and transactional backbone of choice for the worlds
 most innovative companies such as Netflix, Adobe, Intuit, and eBay.



-- 

[image: datastax_logo.png] http://www.datastax.com/

Ryan Svihla

Solution Architect

[image: twitter.png] https://twitter.com/foundev [image: linkedin.png]
http://www.linkedin.com/pub/ryan-svihla/12/621/727/

DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably 

Re: Cassandra for Analytics?

2014-12-18 Thread Peter Lin
for the record I think spark is good and I'm glad we have options.

my point wasn't to bad mouth spark. I'm not comparing spark to storm at
all, so I think there's some confusion here. I'm thinking of espers,
streambase, and other stream processing products. My point is to think
about the problems that needs to be solved before picking a solution. Like
everyone else, I've been guilty of this in the past, so it's not propaganda
for or against any specific product.

I've seen customers user IBM infosphere streams when something like storm
or spark would work, but I've also seen cases where open source doesn't
provide equivalent functionality. If spark meets the needs, then either
hbase or cassandra will probably work fine. The bigger question is what
patterns do you use in the architecture? Do you store the data first before
doing analysis? Is the data noisy and needs filtering before persistence?
What kinds of patterns/queries and operations are needed?

having worked on trading systems and other real-time use cases, not all
stream processing is the same.

On Thu, Dec 18, 2014 at 8:18 AM, Ryan Svihla rsvi...@datastax.com wrote:

 I'll decline to continue the commentary on spark, as again this probably
 belongs on another list, other than to say, microbatches is an intentional
 design tradeoff that has notable benefits for the same use cases you're
 referring too, and that while you may disagree with those tradeoffs, it's a
 bit harsh to dismiss as basic something that was chosen and provides some
 improvements over say..the Storm model.

 On Thu, Dec 18, 2014 at 7:13 AM, Peter Lin wool...@gmail.com wrote:


 some of the most common types of use cases in stream processing is
 sliding windows based on time or count. Based on my understanding of spark
 architecture and spark streaming, it does not provide the same
 functionality. One can fake it by setting spark streaming to really small
 micro-batches, but that's not the same.

 if the use case fits that model, than using spark is fine. For other
 kinds of use cases, spark may not be a good fit. Some people store all
 events before analyzing it, which works for some use cases. While other
 uses cases like trading systems, store before analysis isn't feasible or
 practical. Other use cases like command control also don't fit store before
 analysis model.

 Try to avoid putting the cart infront of the horse. Picking a tool before
 you have a clear understanding of the problem is a good recipe for disaster

 On Thu, Dec 18, 2014 at 8:04 AM, Ryan Svihla rsvi...@datastax.com
 wrote:

 Since Ajay is already using spark the Spark Cassandra Connector really
 gets them where they want to be pretty easily
 https://github.com/datastax/spark-cassandra-connector (joins, etc).

 As far as spark streaming having basic support I'd challenge that
 assertion (namely Storm has a number of problems with delivery guarantees
 that Spark basically solves), however, this isn't a Spark mailing list, and
 perhaps this conversation is better had there.

 If the question Is Cassandra used in real time analytics cases with
 Spark? the answer is absolutely yes (and Storm for that matter). If the
 question is Can you do your analytics queries on Cassandra while you have
 Spark sitting there doing nothing? then of course the answer is no, but
 that'd be a bizzare question, they already have Spark in use.

 On Thu, Dec 18, 2014 at 6:52 AM, Peter Lin wool...@gmail.com wrote:

 that depends on what you mean by real-time analytics.

 For things like continuous data streams, neither are appropriate
 platforms for doing analytics. They're good for storing the results (aka
 output) of the streaming analytics. I would suggest before you decide
 cassandra vs hbase, first figure out exactly what kind of analytics you
 need to do. Start with prototyping and look at what kind of queries and
 patterns you need to support.

 neither hbase or cassandra are good for complex patterns that do joins
 or cross joins (aka mdx), so using either one you have to re-invent stuff.

 most of the event processing and stream processing products out there
 also don't support joins or cross joins very well, so any solution is going
 to need several different components. typically stream processing does
 filtering, which feeds another system that does simple joins. The output of
 the second step can then go to another system that does mdx style queries.

 spark streaming has basic support, but it's not as mature and feature
 rich as other stream processing products.

 On Wed, Dec 17, 2014 at 11:20 PM, Ajay ajay.ga...@gmail.com wrote:

 Hi,

 Can Cassandra be used or best fit for Real Time Analytics? I went
 through couple of benchmark between Cassandra Vs HBase (most of it was 
 done
 3 years ago) and it mentioned that Cassandra is designed for intensive
 writes and Cassandra has higher latency for reads than HBase. In our case,
 we will have writes and reads (but reads will be more say 40% writes and
 60% reads). We are 

Re: Cassandra for Analytics?

2014-12-18 Thread Ryan Svihla
My mistake on Storm, and I'm certain there are a number of use cases where
you're right Spark isn't the right answer, but I'd argue your treating it
like 0.5 Spark feature set wise instead of 1.1 Spark.

As for filtering before persistence..this is the common use case for spark
streaming and I've helped a number of enterprise customers do this very
thing (fraud using windows of various sizes, live aggregation of data, and
joins), typically pulling from a Kafka topic, but it can be adapted to
pretty much any source.

I'd argue you were correct about everything at one time, but you're saying
it can't do things it's been doing in production for awhile now.


On Thu, Dec 18, 2014 at 7:30 AM, Peter Lin wool...@gmail.com wrote:


 for the record I think spark is good and I'm glad we have options.

 my point wasn't to bad mouth spark. I'm not comparing spark to storm at
 all, so I think there's some confusion here. I'm thinking of espers,
 streambase, and other stream processing products. My point is to think
 about the problems that needs to be solved before picking a solution. Like
 everyone else, I've been guilty of this in the past, so it's not propaganda
 for or against any specific product.

 I've seen customers user IBM infosphere streams when something like storm
 or spark would work, but I've also seen cases where open source doesn't
 provide equivalent functionality. If spark meets the needs, then either
 hbase or cassandra will probably work fine. The bigger question is what
 patterns do you use in the architecture? Do you store the data first before
 doing analysis? Is the data noisy and needs filtering before persistence?
 What kinds of patterns/queries and operations are needed?

 having worked on trading systems and other real-time use cases, not all
 stream processing is the same.

 On Thu, Dec 18, 2014 at 8:18 AM, Ryan Svihla rsvi...@datastax.com wrote:

 I'll decline to continue the commentary on spark, as again this probably
 belongs on another list, other than to say, microbatches is an intentional
 design tradeoff that has notable benefits for the same use cases you're
 referring too, and that while you may disagree with those tradeoffs, it's a
 bit harsh to dismiss as basic something that was chosen and provides some
 improvements over say..the Storm model.

 On Thu, Dec 18, 2014 at 7:13 AM, Peter Lin wool...@gmail.com wrote:


 some of the most common types of use cases in stream processing is
 sliding windows based on time or count. Based on my understanding of spark
 architecture and spark streaming, it does not provide the same
 functionality. One can fake it by setting spark streaming to really small
 micro-batches, but that's not the same.

 if the use case fits that model, than using spark is fine. For other
 kinds of use cases, spark may not be a good fit. Some people store all
 events before analyzing it, which works for some use cases. While other
 uses cases like trading systems, store before analysis isn't feasible or
 practical. Other use cases like command control also don't fit store before
 analysis model.

 Try to avoid putting the cart infront of the horse. Picking a tool
 before you have a clear understanding of the problem is a good recipe for
 disaster

 On Thu, Dec 18, 2014 at 8:04 AM, Ryan Svihla rsvi...@datastax.com
 wrote:

 Since Ajay is already using spark the Spark Cassandra Connector really
 gets them where they want to be pretty easily
 https://github.com/datastax/spark-cassandra-connector (joins, etc).

 As far as spark streaming having basic support I'd challenge that
 assertion (namely Storm has a number of problems with delivery guarantees
 that Spark basically solves), however, this isn't a Spark mailing list, and
 perhaps this conversation is better had there.

 If the question Is Cassandra used in real time analytics cases with
 Spark? the answer is absolutely yes (and Storm for that matter). If the
 question is Can you do your analytics queries on Cassandra while you have
 Spark sitting there doing nothing? then of course the answer is no, but
 that'd be a bizzare question, they already have Spark in use.

 On Thu, Dec 18, 2014 at 6:52 AM, Peter Lin wool...@gmail.com wrote:

 that depends on what you mean by real-time analytics.

 For things like continuous data streams, neither are appropriate
 platforms for doing analytics. They're good for storing the results (aka
 output) of the streaming analytics. I would suggest before you decide
 cassandra vs hbase, first figure out exactly what kind of analytics you
 need to do. Start with prototyping and look at what kind of queries and
 patterns you need to support.

 neither hbase or cassandra are good for complex patterns that do joins
 or cross joins (aka mdx), so using either one you have to re-invent stuff.

 most of the event processing and stream processing products out there
 also don't support joins or cross joins very well, so any solution is 
 going
 to need several different components. typically 

Re: Cassandra for Analytics?

2014-12-18 Thread Ajay
Thanks Ryan and Peter for the suggestions.

Our requirement(an ecommerce company) at a higher level is to build a
Datawarehouse as a platform or service(for different product teams to
consume) as below:

Datawarehouse as a platform/service
 |
Spark SQL
 |
Spark in memory computation engine (We were considering Drill/Flink but
Spark is better mature and in production)
 |
Cassandra/HBase (Yet to be decided. Aggregated views + data
directly written to this. So 40%-50% writes, 50-60% reads)
 |
Streaming processing (Spark Streaming or Storm. Yet to be decided.
Spark streaming is relatively new)
|
 My SQL/Mongo/Real Time data

Since we are planning to build it as a service, we cannot consider a
particular data access pattern.

Thanks
Ajay


On Thu, Dec 18, 2014 at 7:00 PM, Peter Lin wool...@gmail.com wrote:


 for the record I think spark is good and I'm glad we have options.

 my point wasn't to bad mouth spark. I'm not comparing spark to storm at
 all, so I think there's some confusion here. I'm thinking of espers,
 streambase, and other stream processing products. My point is to think
 about the problems that needs to be solved before picking a solution. Like
 everyone else, I've been guilty of this in the past, so it's not propaganda
 for or against any specific product.

 I've seen customers user IBM infosphere streams when something like storm
 or spark would work, but I've also seen cases where open source doesn't
 provide equivalent functionality. If spark meets the needs, then either
 hbase or cassandra will probably work fine. The bigger question is what
 patterns do you use in the architecture? Do you store the data first before
 doing analysis? Is the data noisy and needs filtering before persistence?
 What kinds of patterns/queries and operations are needed?

 having worked on trading systems and other real-time use cases, not all
 stream processing is the same.

 On Thu, Dec 18, 2014 at 8:18 AM, Ryan Svihla rsvi...@datastax.com wrote:

 I'll decline to continue the commentary on spark, as again this probably
 belongs on another list, other than to say, microbatches is an intentional
 design tradeoff that has notable benefits for the same use cases you're
 referring too, and that while you may disagree with those tradeoffs, it's a
 bit harsh to dismiss as basic something that was chosen and provides some
 improvements over say..the Storm model.

 On Thu, Dec 18, 2014 at 7:13 AM, Peter Lin wool...@gmail.com wrote:


 some of the most common types of use cases in stream processing is
 sliding windows based on time or count. Based on my understanding of spark
 architecture and spark streaming, it does not provide the same
 functionality. One can fake it by setting spark streaming to really small
 micro-batches, but that's not the same.

 if the use case fits that model, than using spark is fine. For other
 kinds of use cases, spark may not be a good fit. Some people store all
 events before analyzing it, which works for some use cases. While other
 uses cases like trading systems, store before analysis isn't feasible or
 practical. Other use cases like command control also don't fit store before
 analysis model.

 Try to avoid putting the cart infront of the horse. Picking a tool
 before you have a clear understanding of the problem is a good recipe for
 disaster

 On Thu, Dec 18, 2014 at 8:04 AM, Ryan Svihla rsvi...@datastax.com
 wrote:

 Since Ajay is already using spark the Spark Cassandra Connector really
 gets them where they want to be pretty easily
 https://github.com/datastax/spark-cassandra-connector (joins, etc).

 As far as spark streaming having basic support I'd challenge that
 assertion (namely Storm has a number of problems with delivery guarantees
 that Spark basically solves), however, this isn't a Spark mailing list, and
 perhaps this conversation is better had there.

 If the question Is Cassandra used in real time analytics cases with
 Spark? the answer is absolutely yes (and Storm for that matter). If the
 question is Can you do your analytics queries on Cassandra while you have
 Spark sitting there doing nothing? then of course the answer is no, but
 that'd be a bizzare question, they already have Spark in use.

 On Thu, Dec 18, 2014 at 6:52 AM, Peter Lin wool...@gmail.com wrote:

 that depends on what you mean by real-time analytics.

 For things like continuous data streams, neither are appropriate
 platforms for doing analytics. They're good for storing the results (aka
 output) of the streaming analytics. I would suggest before you decide
 cassandra vs hbase, first figure out exactly what kind of analytics you
 need to do. Start with prototyping and look at what kind of queries and
 patterns you need to support.

 neither hbase or cassandra are good for complex patterns that do joins
 or cross joins (aka mdx), so using either one you have to 

Re: Cassandra for Analytics?

2014-12-18 Thread Peter Lin
in the interest of knowledge sharing on the general topic of stream
processing. the domain is quite old and there's a lot of existing
literature.

within this space there are several important factors which many products
don't address:

temporal windows (sliding windows, discrete windows, dynamic windows) -
most support the first 2, but poorly on dynamic windows
temporal validity - for how long is the data valid? - most don't support
this
temporal patterns - patterns that are valid for a finite amount of time -
most don't support this as a first class concept
temporal data types - machine learning systems that can create new data
types - most don't support this
temporal distance - the maximum time-to-live for a specific piece of data -
most don't support this

Having studied many stream processing products, most focus on simple
queries on 1 tuple (aka object type) and basic joining of streams. A tuple
here is basically equivalent to 1 table. Some stream products let you
materialize views (aka projections) like summary tables, but most do not
let you define an in-memory cube to make complex queries easier. For the
most part, the developer has to mentally break down the queries into
multiple pieces and do it manually.

With most products, it's possible to hack together something that looks
like a mdx query, but the level of effort differs. Even then, the bigger
question is the overall architecture. Once the use case is known, it's much
easier to decide what needs to be filtered before persistence and what
needs to be summarized before persistence.

peter

On Thu, Dec 18, 2014 at 8:51 AM, Ryan Svihla rsvi...@datastax.com wrote:

 My mistake on Storm, and I'm certain there are a number of use cases where
 you're right Spark isn't the right answer, but I'd argue your treating it
 like 0.5 Spark feature set wise instead of 1.1 Spark.

 As for filtering before persistence..this is the common use case for spark
 streaming and I've helped a number of enterprise customers do this very
 thing (fraud using windows of various sizes, live aggregation of data, and
 joins), typically pulling from a Kafka topic, but it can be adapted to
 pretty much any source.

 I'd argue you were correct about everything at one time, but you're saying
 it can't do things it's been doing in production for awhile now.


 On Thu, Dec 18, 2014 at 7:30 AM, Peter Lin wool...@gmail.com wrote:


 for the record I think spark is good and I'm glad we have options.

 my point wasn't to bad mouth spark. I'm not comparing spark to storm at
 all, so I think there's some confusion here. I'm thinking of espers,
 streambase, and other stream processing products. My point is to think
 about the problems that needs to be solved before picking a solution. Like
 everyone else, I've been guilty of this in the past, so it's not propaganda
 for or against any specific product.

 I've seen customers user IBM infosphere streams when something like storm
 or spark would work, but I've also seen cases where open source doesn't
 provide equivalent functionality. If spark meets the needs, then either
 hbase or cassandra will probably work fine. The bigger question is what
 patterns do you use in the architecture? Do you store the data first before
 doing analysis? Is the data noisy and needs filtering before persistence?
 What kinds of patterns/queries and operations are needed?

 having worked on trading systems and other real-time use cases, not all
 stream processing is the same.

 On Thu, Dec 18, 2014 at 8:18 AM, Ryan Svihla rsvi...@datastax.com
 wrote:

 I'll decline to continue the commentary on spark, as again this probably
 belongs on another list, other than to say, microbatches is an intentional
 design tradeoff that has notable benefits for the same use cases you're
 referring too, and that while you may disagree with those tradeoffs, it's a
 bit harsh to dismiss as basic something that was chosen and provides some
 improvements over say..the Storm model.

 On Thu, Dec 18, 2014 at 7:13 AM, Peter Lin wool...@gmail.com wrote:


 some of the most common types of use cases in stream processing is
 sliding windows based on time or count. Based on my understanding of spark
 architecture and spark streaming, it does not provide the same
 functionality. One can fake it by setting spark streaming to really small
 micro-batches, but that's not the same.

 if the use case fits that model, than using spark is fine. For other
 kinds of use cases, spark may not be a good fit. Some people store all
 events before analyzing it, which works for some use cases. While other
 uses cases like trading systems, store before analysis isn't feasible or
 practical. Other use cases like command control also don't fit store before
 analysis model.

 Try to avoid putting the cart infront of the horse. Picking a tool
 before you have a clear understanding of the problem is a good recipe for
 disaster

 On Thu, Dec 18, 2014 at 8:04 AM, Ryan Svihla rsvi...@datastax.com
 wrote:

 Since 

Re: Cassandra for Analytics?

2014-12-18 Thread Peter Lin
by data warehouse, what kind do you mean?

is it the traditional warehouse where people create multi-dimensional cubes?
or is it the newer class of UI tools that makes it easier for users to
explore data and the warehouse is mostly a denormalized (ie flattened)
format of the OLTP?
or is it a combination of both?

from my experience, the biggest challenge of data warehousing isn't storing
the data. It's making it easy to explore for adhoc mdx-like queries. In the
old days, the DBA's would define the cubes, write the ETL routines and let
the data load for days/weeks. In the new nosql model, you can avoid the
cube + ETL phase, but discovering the data and understanding the format
still requires a developer.

getting the data into an user friendly format like a cube with Spark
still requires a developer. I find that business users hate to go to the
developer, because we tend to ask what's the functional specs? Most of
the time business users don't know, they just want to explore. At that
point, the storage engine largely doesn't matter to the end user. It
matters to the developers, but business users don't care.

based on the description, I would watch out for how many aggregated views
the platform creates. search the mailing list to see past discussions on
the maximum recommended number of column families.

where classic data warehouse caused lots of pain is creating cubes. Any
general solution attempting to replace/supplement existing products needs
to make it easy and trivial to define adhoc cubes and then query against
it. There are existing products that already connect to a few nosql
databases for data exploration. hope that helps

peter



On Thu, Dec 18, 2014 at 9:01 AM, Ajay ajay.ga...@gmail.com wrote:

 Thanks Ryan and Peter for the suggestions.

 Our requirement(an ecommerce company) at a higher level is to build a
 Datawarehouse as a platform or service(for different product teams to
 consume) as below:

 Datawarehouse as a platform/service
  |
 Spark SQL
  |
 Spark in memory computation engine (We were considering Drill/Flink but
 Spark is better mature and in production)
  |
 Cassandra/HBase (Yet to be decided. Aggregated views + data
 directly written to this. So 40%-50% writes, 50-60% reads)
  |
 Streaming processing (Spark Streaming or Storm. Yet to be decided.
 Spark streaming is relatively new)
 |
  My SQL/Mongo/Real Time data

 Since we are planning to build it as a service, we cannot consider a
 particular data access pattern.

 Thanks
 Ajay


 On Thu, Dec 18, 2014 at 7:00 PM, Peter Lin wool...@gmail.com wrote:


 for the record I think spark is good and I'm glad we have options.

 my point wasn't to bad mouth spark. I'm not comparing spark to storm at
 all, so I think there's some confusion here. I'm thinking of espers,
 streambase, and other stream processing products. My point is to think
 about the problems that needs to be solved before picking a solution. Like
 everyone else, I've been guilty of this in the past, so it's not propaganda
 for or against any specific product.

 I've seen customers user IBM infosphere streams when something like storm
 or spark would work, but I've also seen cases where open source doesn't
 provide equivalent functionality. If spark meets the needs, then either
 hbase or cassandra will probably work fine. The bigger question is what
 patterns do you use in the architecture? Do you store the data first before
 doing analysis? Is the data noisy and needs filtering before persistence?
 What kinds of patterns/queries and operations are needed?

 having worked on trading systems and other real-time use cases, not all
 stream processing is the same.

 On Thu, Dec 18, 2014 at 8:18 AM, Ryan Svihla rsvi...@datastax.com
 wrote:

 I'll decline to continue the commentary on spark, as again this probably
 belongs on another list, other than to say, microbatches is an intentional
 design tradeoff that has notable benefits for the same use cases you're
 referring too, and that while you may disagree with those tradeoffs, it's a
 bit harsh to dismiss as basic something that was chosen and provides some
 improvements over say..the Storm model.

 On Thu, Dec 18, 2014 at 7:13 AM, Peter Lin wool...@gmail.com wrote:


 some of the most common types of use cases in stream processing is
 sliding windows based on time or count. Based on my understanding of spark
 architecture and spark streaming, it does not provide the same
 functionality. One can fake it by setting spark streaming to really small
 micro-batches, but that's not the same.

 if the use case fits that model, than using spark is fine. For other
 kinds of use cases, spark may not be a good fit. Some people store all
 events before analyzing it, which works for some use cases. While other
 uses cases like trading systems, store before analysis isn't feasible or
 practical. Other 

Re: Cassandra for Analytics?

2014-12-18 Thread Ajay
Hi Peter,

You are right.The idea is to directly query the data from No SQL, in our
case via Spark SQL on Spark (as largely Spark support
Mongo/Cassandra/HBase/Hadoop). As you said, the business users still need
to query using Spark SQL. We are already using No SQL BI tools like Pentaho
(which also plans to support Spark SQL soon). The idea is to abstract the
business users from the storage solutions (more than one. Cassandra/HBase 
Mongo).

Thanks
Ajay

On Thu, Dec 18, 2014 at 8:01 PM, Peter Lin wool...@gmail.com wrote:


 by data warehouse, what kind do you mean?

 is it the traditional warehouse where people create multi-dimensional
 cubes?
 or is it the newer class of UI tools that makes it easier for users to
 explore data and the warehouse is mostly a denormalized (ie flattened)
 format of the OLTP?
 or is it a combination of both?

 from my experience, the biggest challenge of data warehousing isn't
 storing the data. It's making it easy to explore for adhoc mdx-like
 queries. In the old days, the DBA's would define the cubes, write the ETL
 routines and let the data load for days/weeks. In the new nosql model, you
 can avoid the cube + ETL phase, but discovering the data and understanding
 the format still requires a developer.

 getting the data into an user friendly format like a cube with Spark
 still requires a developer. I find that business users hate to go to the
 developer, because we tend to ask what's the functional specs? Most of
 the time business users don't know, they just want to explore. At that
 point, the storage engine largely doesn't matter to the end user. It
 matters to the developers, but business users don't care.

 based on the description, I would watch out for how many aggregated views
 the platform creates. search the mailing list to see past discussions on
 the maximum recommended number of column families.

 where classic data warehouse caused lots of pain is creating cubes. Any
 general solution attempting to replace/supplement existing products needs
 to make it easy and trivial to define adhoc cubes and then query against
 it. There are existing products that already connect to a few nosql
 databases for data exploration. hope that helps

 peter



 On Thu, Dec 18, 2014 at 9:01 AM, Ajay ajay.ga...@gmail.com wrote:

 Thanks Ryan and Peter for the suggestions.

 Our requirement(an ecommerce company) at a higher level is to build a
 Datawarehouse as a platform or service(for different product teams to
 consume) as below:

 Datawarehouse as a platform/service
  |
 Spark SQL
  |
 Spark in memory computation engine (We were considering Drill/Flink but
 Spark is better mature and in production)
  |
 Cassandra/HBase (Yet to be decided. Aggregated views + data
 directly written to this. So 40%-50% writes, 50-60% reads)
  |
 Streaming processing (Spark Streaming or Storm. Yet to be
 decided. Spark streaming is relatively new)
 |
  My SQL/Mongo/Real Time data

 Since we are planning to build it as a service, we cannot consider a
 particular data access pattern.

 Thanks
 Ajay


 On Thu, Dec 18, 2014 at 7:00 PM, Peter Lin wool...@gmail.com wrote:


 for the record I think spark is good and I'm glad we have options.

 my point wasn't to bad mouth spark. I'm not comparing spark to storm at
 all, so I think there's some confusion here. I'm thinking of espers,
 streambase, and other stream processing products. My point is to think
 about the problems that needs to be solved before picking a solution. Like
 everyone else, I've been guilty of this in the past, so it's not propaganda
 for or against any specific product.

 I've seen customers user IBM infosphere streams when something like
 storm or spark would work, but I've also seen cases where open source
 doesn't provide equivalent functionality. If spark meets the needs, then
 either hbase or cassandra will probably work fine. The bigger question is
 what patterns do you use in the architecture? Do you store the data first
 before doing analysis? Is the data noisy and needs filtering before
 persistence? What kinds of patterns/queries and operations are needed?

 having worked on trading systems and other real-time use cases, not all
 stream processing is the same.

 On Thu, Dec 18, 2014 at 8:18 AM, Ryan Svihla rsvi...@datastax.com
 wrote:

 I'll decline to continue the commentary on spark, as again this
 probably belongs on another list, other than to say, microbatches is an
 intentional design tradeoff that has notable benefits for the same use
 cases you're referring too, and that while you may disagree with those
 tradeoffs, it's a bit harsh to dismiss as basic something that was chosen
 and provides some improvements over say..the Storm model.

 On Thu, Dec 18, 2014 at 7:13 AM, Peter Lin wool...@gmail.com wrote:


 some of the most common types of use cases in stream processing is
 sliding 

Re: Cassandra for Analytics?

2014-12-18 Thread Colin
Almost every stream processing system I know of offers joins out of the box and 
has done so for years

Even open source offerings like Esper have offered joins for years.

What hasnt are systems like storm, spark, etc which I dont really classify as 
stream processors anyway.



--
Colin Clark 
+1-320-221-9531
 

 On Dec 18, 2014, at 1:52 PM, Peter Lin wool...@gmail.com wrote:
 
 that depends on what you mean by real-time analytics.
 
 For things like continuous data streams, neither are appropriate platforms 
 for doing analytics. They're good for storing the results (aka output) of the 
 streaming analytics. I would suggest before you decide cassandra vs hbase, 
 first figure out exactly what kind of analytics you need to do. Start with 
 prototyping and look at what kind of queries and patterns you need to support.
 
 neither hbase or cassandra are good for complex patterns that do joins or 
 cross joins (aka mdx), so using either one you have to re-invent stuff.
 
 most of the event processing and stream processing products out there also 
 don't support joins or cross joins very well, so any solution is going to 
 need several different components. typically stream processing does 
 filtering, which feeds another system that does simple joins. The output of 
 the second step can then go to another system that does mdx style queries.
 
 spark streaming has basic support, but it's not as mature and feature rich as 
 other stream processing products.
 
 On Wed, Dec 17, 2014 at 11:20 PM, Ajay ajay.ga...@gmail.com wrote:
 Hi,
 
 Can Cassandra be used or best fit for Real Time Analytics? I went through 
 couple of benchmark between Cassandra Vs HBase (most of it was done 3 years 
 ago) and it mentioned that Cassandra is designed for intensive writes and 
 Cassandra has higher latency for reads than HBase. In our case, we will have 
 writes and reads (but reads will be more say 40% writes and 60% reads). We 
 are planning to use Spark as the in memory computation engine.
 
 Thanks
 Ajay


Re: Cassandra for Analytics?

2014-12-18 Thread Peter Lin
@Colin -
I bounce back and forth on classifying storm and spark as stream processing
frameworks. Clearly they are marketed as stream processing frameworks and
they can process data streams. Even with the commercial stream processing
products, expressing joins with some of the products is a bit quirky to
put in a nice way. The streamSql based products tend to be easier for end
users to grok, but it's still not an idea way of expressing temporal
patterns and temporal queries.


that's the reason I always tell our customers figure out your use case
first. though most of them respond with we don't know the use case, but we
know we want to use it


On Thu, Dec 18, 2014 at 10:02 AM, Colin co...@clark.ws wrote:

 Almost every stream processing system I know of offers joins out of the
 box and has done so for years

 Even open source offerings like Esper have offered joins for years.

 What hasnt are systems like storm, spark, etc which I dont really classify
 as stream processors anyway.



 --
 *Colin Clark*
 +1-320-221-9531


 On Dec 18, 2014, at 1:52 PM, Peter Lin wool...@gmail.com wrote:

 that depends on what you mean by real-time analytics.

 For things like continuous data streams, neither are appropriate platforms
 for doing analytics. They're good for storing the results (aka output) of
 the streaming analytics. I would suggest before you decide cassandra vs
 hbase, first figure out exactly what kind of analytics you need to do.
 Start with prototyping and look at what kind of queries and patterns you
 need to support.

 neither hbase or cassandra are good for complex patterns that do joins or
 cross joins (aka mdx), so using either one you have to re-invent stuff.

 most of the event processing and stream processing products out there also
 don't support joins or cross joins very well, so any solution is going to
 need several different components. typically stream processing does
 filtering, which feeds another system that does simple joins. The output of
 the second step can then go to another system that does mdx style queries.

 spark streaming has basic support, but it's not as mature and feature rich
 as other stream processing products.

 On Wed, Dec 17, 2014 at 11:20 PM, Ajay ajay.ga...@gmail.com wrote:

 Hi,

 Can Cassandra be used or best fit for Real Time Analytics? I went through
 couple of benchmark between Cassandra Vs HBase (most of it was done 3 years
 ago) and it mentioned that Cassandra is designed for intensive writes and
 Cassandra has higher latency for reads than HBase. In our case, we will
 have writes and reads (but reads will be more say 40% writes and 60%
 reads). We are planning to use Spark as the in memory computation engine.

 Thanks
 Ajay