Re: Spark and intermediate results

2015-10-09 Thread Marcelo Valle (BLOOMBERG/ LONDON)
I know the connector, but having the connector only means it will take *input* 
data from Cassandra, right? What about intermediate results?
If it stores intermediate results on Cassandra, could you please clarify how 
data locality is handled? Will it store in other keyspace? 
I could not find any doc about it...

From: user@cassandra.apache.org 
Subject: Re: Spark and intermediate results

You can run spark against your Cassandra data directly without using a shared 
filesystem. 

https://github.com/datastax/spark-cassandra-connector


On Fri, Oct 9, 2015 at 6:09 AM Marcelo Valle (BLOOMBERG/ LONDON) 
<mvallemil...@bloomberg.net> wrote:

Hello, 

I saw this nice link from an event:

http://www.datastax.com/dev/blog/zen-art-spark-maintenance?mkt_tok=3RkMMJWWfF9wsRogvqzIZKXonjHpfsX56%2B8uX6GylMI%2F0ER3fOvrPUfGjI4GTcdmI%2BSLDwEYGJlv6SgFSrXMMblswLgIXBY%3D

I would like to test using Spark to perform some operations on a column family, 
my objective is reading from CF A and writing the output of my M/R job to CF B. 

That said, I've read this from Spark's FAQ (http://spark.apache.org/faq.html):

"Do I need Hadoop to run Spark?
No, but if you run on a cluster, you will need some form of shared file system 
(for example, NFS mounted at the same path on each node). If you have this type 
of filesystem, you can just deploy Spark in standalone mode."

The question I ask is - if I don't want to have a HDFS instalation just to run 
Spark on Cassandra, is my only option to have this NFS mounted over network? 
It doesn't seem smart to me to have something as NFS to store Spark files, as 
it would probably affect performance, and at the same time I wouldn't like to 
have an additional HDFS cluster just to run jobs on Cassandra. 
Is there a way of using Cassandra itself as this "some form of shared file 
system"?

-Marcelo


<< ideas don't deserve respect >>


<< ideas dont deserve respect >>

Spark and intermediate results

2015-10-09 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Hello, 

I saw this nice link from an event:

http://www.datastax.com/dev/blog/zen-art-spark-maintenance?mkt_tok=3RkMMJWWfF9wsRogvqzIZKXonjHpfsX56%2B8uX6GylMI%2F0ER3fOvrPUfGjI4GTcdmI%2BSLDwEYGJlv6SgFSrXMMblswLgIXBY%3D

I would like to test using Spark to perform some operations on a column family, 
my objective is reading from CF A and writing the output of my M/R job to CF B. 

That said, I've read this from Spark's FAQ (http://spark.apache.org/faq.html):

"Do I need Hadoop to run Spark?
No, but if you run on a cluster, you will need some form of shared file system 
(for example, NFS mounted at the same path on each node). If you have this type 
of filesystem, you can just deploy Spark in standalone mode."

The question I ask is - if I don't want to have a HDFS instalation just to run 
Spark on Cassandra, is my only option to have this NFS mounted over network? 
It doesn't seem smart to me to have something as NFS to store Spark files, as 
it would probably affect performance, and at the same time I wouldn't like to 
have an additional HDFS cluster just to run jobs on Cassandra. 
Is there a way of using Cassandra itself as this "some form of shared file 
system"?

-Marcelo


<< ideas dont deserve respect >>

Re: ScyllaDB, a new open source, Cassandra-compatible NoSQL

2015-09-23 Thread Marcelo Valle (BLOOMBERG/ LONDON)
I think there is a very important point in Scylladb - latency. 
Performance can be an important requirement, but the fact scylladb is written 
in C and uses lock free algorithms inside means it should have lower latency 
than Cassandra, which enables it's use for a wider range of applications. 
It seems like a huge milestone achieved by Cassandra community, congratulations!

From: user@cassandra.apache.org 
Subject: Re: ScyllaDB, a new open source, Cassandra-compatible NoSQL

Looking at the architecture and what scylladb does, I'm not surprised they got 
10x improvement. SeaStar skips a lot of the overhead of copying stuff and it 
gives them CPU core affinity. Anyone that's listened to Clif Click talk about 
cache misses, locks and other low level stuff would recognize the huge boost in 
performance when many of those bottlenecks are removed. Using an actor model to 
avoid locks doesn't hurt either.

On Tue, Sep 22, 2015 at 5:20 PM, Minh Do  wrote:

First glance at their github, it looks like they re-implemented Cassandra in 
C++.  90% components in Cassandra are
in scylladb, i.e. compaction, repair, CQL, gossip, SStable.


With C++, I believe this helps performance to some extent up to a point when 
compaction has not run yet.  
Then, it will be disk IO to be the dominant factor in the performance 
measurement as the more traffics to a node the more degrading
the performance is across the cluster.

Also, they only support Thrift protocol so it won't work with Java Driver with 
the new asynchronous protocol.  I doubt their tests 
are truly a fair one.

On Tue, Sep 22, 2015 at 2:13 PM, Venkatesh Arivazhagan  
wrote:

I came across this article: 
zdnet.com/article/kvm-creators-open-source-fast-cassandra-drop-in-replacement-scylla/

Tzach, I would love to know/understand moree about ScyllaDB too. Also the 
benchmark seems to have only 1 DB Server. Do you have benchmark numbers where 
more than 1 DB servers were involved? :)


On Tue, Sep 22, 2015 at 1:40 PM, Sachin Nikam  wrote:

Tzach,
Can you point to any documentation on scylladb site which talks about how/why 
scylla db performs better than Cassandra while using the same architecture?
Regards
Sachin

On Tue, Sep 22, 2015 at 9:18 AM, Tzach Livyatan  
wrote:

Hello Cassandra users,

We are pleased to announce a new member of the Cassandra Ecosystem - ScyllaDB
ScyllaDB is a new, open source, Cassandra-compatible NoSQL data store, written 
with the goal of delivering superior performance and consistent low latency.  
Today, ScyllaDB runs 1M tps per server with sub 1ms latency.

ScyllaDB  supports CQL, is compatible with Cassandra drivers, and works out of 
the box with Cassandra tools like cqlsh, Spark connector, nodetool and 
cassandra-stress. ScyllaDB is a drop-in replacement solution for the Cassandra 
server side packages.

Scylla is implemented using the new shared-nothing Seastar framework for 
extreme performance on modern multicore hardware, and the Data Plane 
Development Kit (DPDK) for high-speed low-latency networking.

Try Scylla Now - http://www.scylladb.com

We will be at Cassandra summit 2015, you are welcome to visit our booth to hear 
more and see a demo.
Avi Kivity, our CTO, will host a session on Scylla on Thursday, 1:50 PM - 2:30 
PM in rooms M1 - M3.

Regards
Tzach
scylladb


<< ideas dont deserve respect >>

Re: how many rows can one partion key hold?

2015-02-27 Thread Marcelo Valle (BLOOMBERG/ LONDON)
 When one partition's data is extreme large, the write/read will slow?

This is actually a good question, If a partition has near 2 billion rows, will 
writes or reads get too slow? My understanding is it shouldn't, as data is 
indexed inside a partition and when you read or write you are doing a binary 
search, so it should take log (n) time for the operation. 

However, my practical experience tells me it can be a problem depending on the 
number of reads you do and how you do them. It your binary search takes 2 more 
steps, but for 1 billion reads, it could be considerably slow. Also, this 
search could be done on disk, as it depends a lot on how your cache is 
configured.

Having a small amount per partition could be a Cassandra anti-pattern though, 
mainly if your reads can go across many partitions. 

I think there is no correct answer here, it depends on your data and on your 
application, IMHO.

-Marcelo


From: user@cassandra.apache.org 
Subject: Re: how many rows can one partion key hold?

you might want to read here 
http://wiki.apache.org/cassandra/CassandraLimitations

jason

On Fri, Feb 27, 2015 at 2:44 PM, wateray wate...@163.com wrote:

Hi all,
My team is using Cassandra as our database. We have one question as below.
As we know, the row with the some partition key will be stored in the some node.
But how many rows can one partition key hold? What is it depend on? The node's 
volume or partition data size or partition rows size(the number of rows)?
 When one partition's data is  extreme large, the write/read will slow?
Can anyone show me some exist usecases.
 thanks!

 




Re: Unexplained query slowness

2015-02-26 Thread Marcelo Valle (BLOOMBERG/ LONDON)
I didn't know about this cfhistograms thing, very nice!

From: user@cassandra.apache.org 
Subject: Re: Unexplained query slowness

Have a look at your column family histograms (nodetool cfhistograms iirc), if 
you notice things like a very long tail, a double hump or outliers it would 
indicate something wrong with your data model or you have a hot partition key/s.

Also looking at your 99 and 95 percentile latencies will just hide these 
occasional high latency reads as they fall outside these percentiles.

If you are running a stock config, first rule out that its not your data model, 
then investigate things like disk latency, noisy neighbours (if you are on vms/ 
in the cloud).

On 26 February 2015 at 03:01, Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net wrote:

I am sorry if it's too basic and you already looked at that, but the first 
thing I would ask would be the data model.

What data model are you using (how is your data partitioned)? What queries are 
you running? If you are using ALLOW FILTERING, for instance, it will be very 
easy to say why it's slow. 

Most times people get slow queries in Cassandra they are using the wrong data 
model.

[]s

From: user@cassandra.apache.org 
Subject: Re:Unexplained query slowness

Our Cassandra database just rolled to live last night. I’m looking at our query 
performance, and overall it is very good, but perhaps 1 in 10,000 queries takes 
several hundred milliseconds (up to a full second). I’ve grepped for GC in the 
system.log on all nodes, and there aren’t any recent GC events. I’m executing 
~500 queries per second, which produces negligible load and CPU utilization. I 
have very minimal writes (one every few minutes). The slow queries are across 
the board. There isn’t one particular query that is slow.

I’m running 2.0.12 with SSD’s. I’ve got a 10 node cluster with RF=3.

I have no idea where to even begin to look. Any thoughts on where to start 
would be greatly appreciated.

Robert


-- 

Ben Bromhead
Instaclustr | www.instaclustr.com | @instaclustr | (650) 284 9692




Re:Unexplained query slowness

2015-02-25 Thread Marcelo Valle (BLOOMBERG/ LONDON)
I am sorry if it's too basic and you already looked at that, but the first 
thing I would ask would be the data model.

What data model are you using (how is your data partitioned)? What queries are 
you running? If you are using ALLOW FILTERING, for instance, it will be very 
easy to say why it's slow. 

Most times people get slow queries in Cassandra they are using the wrong data 
model.

[]s

From: user@cassandra.apache.org 
Subject: Re:Unexplained query slowness

Our Cassandra database just rolled to live last night. I’m looking at our query 
performance, and overall it is very good, but perhaps 1 in 10,000 queries takes 
several hundred milliseconds (up to a full second). I’ve grepped for GC in the 
system.log on all nodes, and there aren’t any recent GC events. I’m executing 
~500 queries per second, which produces negligible load and CPU utilization. I 
have very minimal writes (one every few minutes). The slow queries are across 
the board. There isn’t one particular query that is slow.

I’m running 2.0.12 with SSD’s. I’ve got a 10 node cluster with RF=3.

I have no idea where to even begin to look. Any thoughts on where to start 
would be greatly appreciated.

Robert




Re:Cassandra Read Timeout

2015-02-24 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Yulian,

Maybe other people have other clues, but I think if you could monitor the 
behavior in tpstats after activity Seeking to partition beginning in data 
file it could help to find the problem. Which type of thread is getting stuck? 
Do you see any number increasing continuously during the request?

Best regards,
Marcelo.

From: user@cassandra.apache.org 
Subject: Re:Cassandra Read Timeout

Hello to all 
I have single node cassandra on amazon ec2.
Currently i am having a read timeout problem on single CF , single raw.

Raw size is aroung 190MB.There are bigger raws with similar structure ( its 
index raws , which actually stores keys ) and everything is working fine on 
them, everything is working also fine on this cf but on other raw.

Tables data from CFStats ( First table has bigger raws but works fine , where 
second has timeout ) :
Column Family: pendindexes
SSTable count: 5
Space used (live): 462298352
Space used (total): 462306752
SSTable Compression Ratio: 0.3511107495795905
Number of Keys (estimate): 640
Memtable Columns Count: 63339
Memtable Data Size: 12328802
Memtable Switch Count: 78
Read Count: 10
Read Latency: NaN ms.
Write Count: 1530113
Write Latency: 0.022 ms.
Pending Tasks: 0
Bloom Filter False Positives: 0
Bloom Filter False Ratio: 0.0
Bloom Filter Space Used: 3920
Compacted row minimum size: 73
Compacted row maximum size: 223875792
Compacted row mean size: 42694982
Average live cells per slice (last five minutes): 21.0
Average tombstones per slice (last five minutes): 0.0

Column Family: statuspindexes
SSTable count: 1
Space used (live): 99602136
Space used (total): 99609360
SSTable Compression Ratio: 0.34278775390997873
Number of Keys (estimate): 128
Memtable Columns Count: 6250
Memtable Data Size: 6061097
Memtable Switch Count: 65
Read Count: 1000
Read Latency: NaN ms.
Write Count: 1193142
Write Latency: 3.616 ms.
Pending Tasks: 0
Bloom Filter False Positives: 0
Bloom Filter False Ratio: 0.0
Bloom Filter Space Used: 656
Compacted row minimum size: 180
Compacted row maximum size: 186563160
Compacted row mean size: 63225562
Average live cells per slice (last five minutes): 0.0
Average tombstones per slice (last five minutes): 0.0

I have tried to debug it with cql , thats what i get:

 activity   
 | timestamp| source   | source_elapsed
-+--+--+
  
execute_cql3_query | 15:39:53,120 | 172.31.6.173 |  0
   Parsing Select * from 
statuspindexes LIMIT 1; | 15:39:53,120 | 172.31.6.173 |875
 
Preparing statement | 15:39:53,121 | 172.31.6.173 |   1643
   Determining 
replicas to query | 15:39:53,121 | 172.31.6.173 |   1740
 Executing seq scan across 1 sstables for [min(-9223372036854775808), 
min(-9223372036854775808)] | 15:39:53,122 | 172.31.6.173 |   2581
 Seeking to partition 
beginning in data file | 15:39:53,123 | 172.31.6.173 |   3118
   Timed out; received 0 of 1 responses 
for range 2 of 2 | 15:40:03,121 | 172.31.6.173 |   10001370

Request complete | 15:40:03,121 | 172.31.6.173 |   10001513

I have executed compaction on that cf.
What could lead to that problem?
Best regards
Yulian Oifa




Re: Cassandra Read Timeout

2015-02-24 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Indeed, I thought something odd could be happening to your cluster, but it 
seems it's working fine but the request is taking too long to complete. 

I noticed from your cfstats the read count was about 10 in the first CF and in 
the second one it was about 1000... Would you be doing much more reads in the 
second one? If the load is higher, it could justify the timeout.

Do both CFs have the same data model? Are you running exactly the same queries? 

Best regards,
Marcelo.

From: user@cassandra.apache.org 
Subject: Re: Cassandra Read Timeout

Hello
TP STATS Before Request:

Pool NameActive   Pending  Completed   Blocked  All 
time blocked
ReadStage 0 07592835 0  
   0
RequestResponseStage  0 0  0 0  
   0
MutationStage 0 0  215980736 0  
   0
ReadRepairStage   0 0  0 0  
   0
ReplicateOnWriteStage 0 0  0 0  
   0
GossipStage   0 0  0 0  
   0
AntiEntropyStage  0 0  0 0  
   0
MigrationStage0 0 28 0  
   0
MemoryMeter   0 0474 0  
   0
MemtablePostFlusher   0 0  32845 0  
   0
FlushWriter   0 0   4013 0  
2239
MiscStage 0 0  0 0  
   0
PendingRangeCalculator0 0  1 0  
   0
commitlog_archiver0 0  0 0  
   0
InternalResponseStage 0 0  0 0  
   0
HintedHandoff 0 0  0 0  
   0

Message type   Dropped
RANGE_SLICE  0
READ_REPAIR  0
BINARY   0
READ 0
MUTATION 0
_TRACE   0
REQUEST_RESPONSE 0
COUNTER_MUTATION 0

TP STATS After Request:

Pool NameActive   Pending  Completed   Blocked  All 
time blocked
ReadStage 1 17592942 0  
   0
RequestResponseStage  0 0  0 0  
   0
MutationStage 0 0  215983339 0  
   0
ReadRepairStage   0 0  0 0  
   0
ReplicateOnWriteStage 0 0  0 0  
   0
GossipStage   0 0  0 0  
   0
AntiEntropyStage  0 0  0 0  
   0
MigrationStage0 0 28 0  
   0
MemoryMeter   0 0474 0  
   0
MemtablePostFlusher   0 0  32845 0  
   0
FlushWriter   0 0   4013 0  
2239
MiscStage 0 0  0 0  
   0
PendingRangeCalculator0 0  1 0  
   0
commitlog_archiver0 0  0 0  
   0
InternalResponseStage 0 0  0 0  
   0
HintedHandoff 0 0  0 0  
   0

Message type   Dropped
RANGE_SLICE  0
READ_REPAIR  0
BINARY   0
READ 0
MUTATION 0
_TRACE   0
REQUEST_RESPONSE 0
COUNTER_MUTATION 0

The only items changed are : ReadStage increased by 107 + 1 Active/Pending and 
MutationStage which changed by 2603.
Please note that system is writing all the time in batches ( each second 2 
servers write one batch each ) so i dont see anything special in this numbers.

Best regards
Yulian Oifa




Re: Cassandra Read Timeout

2015-02-24 Thread Marcelo Valle (BLOOMBERG/ LONDON)
I am sorry, not sure if I will be able to help you. I am not familiar with 
super columns, I would tell you to try to get rid of them as soon as possible. 
Maybe someone else in the list can help you.
Anyway, it really seems you have a cell with a very large amount of data and 
the request might be taking longer to complete because of the amount, but I 
also don't understand why the requests on same row different SC and same SC 
different row would work. 

[]s

From: oifa.yul...@gmail.com 
Subject: Re: Cassandra Read Timeout

Hello
I am running 1.2.19
Best regards
Yulian Oifa

On Tue, Feb 24, 2015 at 6:57 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net wrote:

Super column? Out of curiosity, which Cassandra version are you running?


From: user@cassandra.apache.org 
Subject: Re: Cassandra Read Timeout

Hello
The structure is the same , the CFs are super column CFs , where key is long  ( 
timestamp to partition the index , so each 11 days new row is created ) , super 
Column is int32 and columns / values are timeuuids.I am running same queries , 
getting reversed slice by raw key and super column.
The number of reads is relatively high on second CF since i am testing it 
several hours already , most of time there are no requests for read on both of 
them , only writes.There is at most 1 read request in 20-30 seconds so it 
should not create a load.There is also no reads ( 0 before and 1 after ) 
pending on tpstats.
Please also note that queries on different row same super column , and same row 
, different super column are working , and if i am not mistaken cassandra is 
loading complete raw including all super columns to memory ( either any request 
to this row should fail if this would be a memory problem , or none...).

Best regards 
Yulian Oifa




Re: Cassandra Read Timeout

2015-02-24 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Super column? Out of curiosity, which Cassandra version are you running?


From: user@cassandra.apache.org 
Subject: Re: Cassandra Read Timeout

Hello
The structure is the same , the CFs are super column CFs , where key is long  ( 
timestamp to partition the index , so each 11 days new row is created ) , super 
Column is int32 and columns / values are timeuuids.I am running same queries , 
getting reversed slice by raw key and super column.
The number of reads is relatively high on second CF since i am testing it 
several hours already , most of time there are no requests for read on both of 
them , only writes.There is at most 1 read request in 20-30 seconds so it 
should not create a load.There is also no reads ( 0 before and 1 after ) 
pending on tpstats.
Please also note that queries on different row same super column , and same row 
, different super column are working , and if i am not mistaken cassandra is 
loading complete raw including all super columns to memory ( either any request 
to this row should fail if this would be a memory problem , or none...).

Best regards 
Yulian Oifa




Re:designing table

2015-02-20 Thread Marcelo Valle (BLOOMBERG/ LONDON)
My cents:
You could partition your data per date and second query would be easy. 
If you need to query ALL data for a client id, it would be hard though, but 
querying last 10 days for a client id could be easy, for instance. 
If you need to query ALL, it would probably be better to create another CF and 
write on both, google for Cassandra materialized view in this case.
[]s

From: user@cassandra.apache.org 
Subject: Re:designing table

I am trying to design a table in Cassandra in which I will have multiple JSON 
String for a particular client id.

 abc123 -   jsonA
  abc123 -   jsonB
  abcd12345   -   jsonC
   
My query pattern is going to be - 

Give me all JSON String for a particular client id.
Give me all the client id's and json strings for a particular date.

What is the best way to design table for this?



Re:PySpark and Cassandra integration

2015-02-20 Thread Marcelo Valle (BLOOMBERG/ LONDON)
I will try it for sure Frens, very nice!
Thanks for sharing!

From: user@cassandra.apache.org 
Subject: Re:PySpark and Cassandra integration


Hi all,
Wanted to let you know I've forked PySpark Cassandra on 
https://github.com/TargetHolding/pyspark-cassandra. Unfortunately the original 
code didn't work for me and I couldn't figure out how it could work. But it 
inspired! so I rewrote the majority of the project.
The rewrite implements full usage 
ofhttps://github.com/datastax/spark-cassandra-connector and brings much of it's 
goodness to PySpark!
Hope that some of you are able to put this to good use. And feedback, pull 
requests, etc. are more than welcome!
Best regards,
Frens Jan



Re:query by column size

2015-02-13 Thread Marcelo Valle (BLOOMBERG/ LONDON)
There is no automatic indexing in Cassandra. There are secondary indexes, but 
not for these cases.
You could use a solution like DSE, to get data automatically indexed on solr, 
in each node, as soon as data comes. Then you could do such a query on solr.
If the query can be slow, you could run a MR job over all rows, filtering the 
ones you want.
[]s
From: user@cassandra.apache.org 
Subject: Re:query by column size

Greetings,

I have one column family with 10 columns,  one of the column we store xml/json.
Is there a way I can query  that column where size  50kb  ?  assuming I  have 
index on that column.

thanks
CV.



Re: best supported spark connector for Cassandra

2015-02-13 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Actually, I am not the one looking for support, but I thank you a lot anyway.
But from your message I guess the answer is yes, Datastax is not the only 
Cassandra vendor offering support and changing official Cassandra source at 
this moment, is this right?
From: user@cassandra.apache.org 
Subject: Re: best supported spark connector for Cassandra

Of course, Stratio Deep and Stratio Cassandra are licensed  Apache 2.0.   

Regarding the Cassandra support, I can introduce you to someone in Stratio that 
can help you. 

2015-02-12 15:05 GMT+01:00 Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net:

Thanks for the hint Gaspar. 
Do you know if Stratio Deep / Stratio Cassandra are also licensed Apache 2.0?

I had interest in knowing more about Stratio when I was working on a start up. 
Now, on a blueship, it seems one of the hardest obstacles to use Cassandra in a 
project is the need of an area supporting it, and it seems people are specially 
concerned about how many vendors an open source solution has to provide 
support. 

This seems to be kind of an advantage of HBase, as there are many vendors 
supporting it, but I wonder if Stratio can be considered an alternative to 
Datastax reggarding Cassandra support?

It's not my call here to decide anything, but as part of the community it helps 
to have this business scenario clear. I could say Cassandra could be the best 
fit technical solution for some projects but sometimes non-technical factors 
are in the game, like this need for having more than one vendor available...


From: gmu...@stratio.com 
Subject: Re: best supported spark connector for Cassandra

My suggestion is to use Java or Scala instead of Python. For Java/Scala both 
the Datastax and Stratio drivers are valid and similar options. As far as I 
know they both take care about data locality and are not based on the Hadoop 
interface. The advantage of Stratio Deep is that allows you to integrate Spark 
not only with Cassandra but with MongoDB, Elasticsearch, Aerospike and others 
as well. 
Stratio has a forked Cassandra for including some additional features such as 
Lucene based secondary indexes. So Stratio driver works fine with the Apache 
Cassandra and also with their fork.

You can find some examples of using Deep here: 
https://github.com/Stratio/deep-examples  Please if you need some help with 
Stratio Deep do not hesitate to contact us.


2015-02-11 17:18 GMT+01:00 shahab shahab.mok...@gmail.com:

I am using Calliope cassandra-spark 
connector(http://tuplejump.github.io/calliope/), which is quite handy and easy 
to use!
The only problem is that it is a bit outdates , works with Spark 1.1.0, 
hopefully new version comes soon.

best,
/Shahab

On Wed, Feb 11, 2015 at 2:51 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net wrote:

I just finished a scala course, nice exercise to check what I learned :D

Thanks for the answer!

From: user@cassandra.apache.org 
Subject: Re: best supported spark connector for Cassandra

Start looking at the Spark/Cassandra connector here (in Scala): 
https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector

Data locality is provided by this method: 
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L329-L336

Start digging from this all the way down the code.

As for Stratio Deep, I can't tell how the did the integration with Spark. Take 
some time to dig down their code to understand the logic. 


On Wed, Feb 11, 2015 at 2:25 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net wrote:

Taking the opportunity Spark was being discussed in another thread, I decided 
to start a new one as I have interest in using Spark + Cassandra in the feature.

About 3 years ago, Spark was not an existing option and we tried to use hadoop 
to process Cassandra data. My experience was horrible and we reached the 
conclusion it was faster to develop an internal tool than insist on Hadoop _for 
our specific case_. 

How I can see Spark is starting to be known as a better hadoop and it seems 
market is going this way now. I can also see I have many more options to decide 
how to integrate Cassandra using the Spark RDD concept than using the 
ColumnFamilyInputFormat. 

I have found this java driver made by Datastax: 
https://github.com/datastax/spark-cassandra-connector

I also have found python Cassandra support on spark's repo, but it seems 
experimental yet: 
https://github.com/apache/spark/tree/master/examples/src/main/python

Finally I have found stratio deep: https://github.com/Stratio/deep-spark
It seems Stratio guys have forked Cassandra also, I am still a little confused 
about it.

Question: which driver should I use, if I want to use Java? And which if I want 
to use python? 
I think the way Spark can integrate to Cassandra makes all the difference

Re: best supported spark connector for Cassandra

2015-02-13 Thread Marcelo Valle (BLOOMBERG/ LONDON)
For SQL queries on Cassandra I used to use Presto: https://prestodb.io/

It's a nice tool from FB and seems to work well with Cassandra. You can use 
their JDBC driver with your favourite java SQL tool. 

Inside my apps, I never needed to use SQL queries.

[]s
From: pavel.velik...@gmail.com 
Subject: Re: best supported spark connector for Cassandra

Hi Marcelo,

  Were you able to use the Spark SQL features of the Cassandra connector? I 
couldn’t make a .jar that wouldn’t confict with Spark SQL native .jar…
So I ended up using only the basic features, cannot use SQL queries.


On Feb 13, 2015, at 7:49 PM, Paulo Ricardo Motta Gomes 
paulo.mo...@chaordicsystems.com wrote:
I used to use calliope, which was really awesome before DataStax native 
integration with Spark. Now I'm quite happy with the official DataStax spark 
connector, it's very straightforward to use.

I never tried to use these drivers with Java though, I'd suggest you to use 
them with Scala, which is the best option to write spark jobs.

On Fri, Feb 13, 2015 at 12:12 PM, Carlos Rolo r...@pythian.com wrote:

Not for sure ;)

If you need Cassandra support I can forward you to someone to talk to at 
Pythian.

Regards,

Regards,

Carlos Juzarte Rolo
Cassandra Consultant
 
Pythian - Love your data

rolo@pythian | Twitter: cjrolo | Linkedin: linkedin.com/in/carlosjuzarterolo
Tel: 1649
www.pythian.com

On Fri, Feb 13, 2015 at 3:05 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net wrote:

Actually, I am not the one looking for support, but I thank you a lot anyway.
But from your message I guess the answer is yes, Datastax is not the only 
Cassandra vendor offering support and changing official Cassandra source at 
this moment, is this right?

From: user@cassandra.apache.org 
Subject: Re: best supported spark connector for Cassandra

Of course, Stratio Deep and Stratio Cassandra are licensed  Apache 2.0.   

Regarding the Cassandra support, I can introduce you to someone in Stratio that 
can help you. 

2015-02-12 15:05 GMT+01:00 Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net:

Thanks for the hint Gaspar. 
Do you know if Stratio Deep / Stratio Cassandra are also licensed Apache 2.0?

I had interest in knowing more about Stratio when I was working on a start up. 
Now, on a blueship, it seems one of the hardest obstacles to use Cassandra in a 
project is the need of an area supporting it, and it seems people are specially 
concerned about how many vendors an open source solution has to provide 
support. 

This seems to be kind of an advantage of HBase, as there are many vendors 
supporting it, but I wonder if Stratio can be considered an alternative to 
Datastax reggarding Cassandra support?

It's not my call here to decide anything, but as part of the community it helps 
to have this business scenario clear. I could say Cassandra could be the best 
fit technical solution for some projects but sometimes non-technical factors 
are in the game, like this need for having more than one vendor available...


From: gmu...@stratio.com 
Subject: Re: best supported spark connector for Cassandra

My suggestion is to use Java or Scala instead of Python. For Java/Scala both 
the Datastax and Stratio drivers are valid and similar options. As far as I 
know they both take care about data locality and are not based on the Hadoop 
interface. The advantage of Stratio Deep is that allows you to integrate Spark 
not only with Cassandra but with MongoDB, Elasticsearch, Aerospike and others 
as well. 
Stratio has a forked Cassandra for including some additional features such as 
Lucene based secondary indexes. So Stratio driver works fine with the Apache 
Cassandra and also with their fork.

You can find some examples of using Deep here: 
https://github.com/Stratio/deep-examples  Please if you need some help with 
Stratio Deep do not hesitate to contact us.


2015-02-11 17:18 GMT+01:00 shahab shahab.mok...@gmail.com:

I am using Calliope cassandra-spark 
connector(http://tuplejump.github.io/calliope/), which is quite handy and easy 
to use!
The only problem is that it is a bit outdates , works with Spark 1.1.0, 
hopefully new version comes soon.

best,
/Shahab

On Wed, Feb 11, 2015 at 2:51 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net wrote:

I just finished a scala course, nice exercise to check what I learned :D

Thanks for the answer!

From: user@cassandra.apache.org 
Subject: Re: best supported spark connector for Cassandra

Start looking at the Spark/Cassandra connector here (in Scala): 
https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector

Data locality is provided by this method: 
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L329-L336

Start digging from this all the way down the code.

As for Stratio Deep, I can't tell

Re: best supported spark connector for Cassandra

2015-02-12 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Thanks for the hint Gaspar. 
Do you know if Stratio Deep / Stratio Cassandra are also licensed Apache 2.0?

I had interest in knowing more about Stratio when I was working on a start up. 
Now, on a blueship, it seems one of the hardest obstacles to use Cassandra in a 
project is the need of an area supporting it, and it seems people are specially 
concerned about how many vendors an open source solution has to provide 
support. 

This seems to be kind of an advantage of HBase, as there are many vendors 
supporting it, but I wonder if Stratio can be considered an alternative to 
Datastax reggarding Cassandra support?

It's not my call here to decide anything, but as part of the community it helps 
to have this business scenario clear. I could say Cassandra could be the best 
fit technical solution for some projects but sometimes non-technical factors 
are in the game, like this need for having more than one vendor available...


From: gmu...@stratio.com 
Subject: Re: best supported spark connector for Cassandra

My suggestion is to use Java or Scala instead of Python. For Java/Scala both 
the Datastax and Stratio drivers are valid and similar options. As far as I 
know they both take care about data locality and are not based on the Hadoop 
interface. The advantage of Stratio Deep is that allows you to integrate Spark 
not only with Cassandra but with MongoDB, Elasticsearch, Aerospike and others 
as well. 
Stratio has a forked Cassandra for including some additional features such as 
Lucene based secondary indexes. So Stratio driver works fine with the Apache 
Cassandra and also with their fork.

You can find some examples of using Deep here: 
https://github.com/Stratio/deep-examples  Please if you need some help with 
Stratio Deep do not hesitate to contact us.


2015-02-11 17:18 GMT+01:00 shahab shahab.mok...@gmail.com:

I am using Calliope cassandra-spark 
connector(http://tuplejump.github.io/calliope/), which is quite handy and easy 
to use!
The only problem is that it is a bit outdates , works with Spark 1.1.0, 
hopefully new version comes soon.

best,
/Shahab

On Wed, Feb 11, 2015 at 2:51 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net wrote:

I just finished a scala course, nice exercise to check what I learned :D

Thanks for the answer!

From: user@cassandra.apache.org 
Subject: Re: best supported spark connector for Cassandra

Start looking at the Spark/Cassandra connector here (in Scala): 
https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector

Data locality is provided by this method: 
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L329-L336

Start digging from this all the way down the code.

As for Stratio Deep, I can't tell how the did the integration with Spark. Take 
some time to dig down their code to understand the logic. 


On Wed, Feb 11, 2015 at 2:25 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net wrote:

Taking the opportunity Spark was being discussed in another thread, I decided 
to start a new one as I have interest in using Spark + Cassandra in the feature.

About 3 years ago, Spark was not an existing option and we tried to use hadoop 
to process Cassandra data. My experience was horrible and we reached the 
conclusion it was faster to develop an internal tool than insist on Hadoop _for 
our specific case_. 

How I can see Spark is starting to be known as a better hadoop and it seems 
market is going this way now. I can also see I have many more options to decide 
how to integrate Cassandra using the Spark RDD concept than using the 
ColumnFamilyInputFormat. 

I have found this java driver made by Datastax: 
https://github.com/datastax/spark-cassandra-connector

I also have found python Cassandra support on spark's repo, but it seems 
experimental yet: 
https://github.com/apache/spark/tree/master/examples/src/main/python

Finally I have found stratio deep: https://github.com/Stratio/deep-spark
It seems Stratio guys have forked Cassandra also, I am still a little confused 
about it.

Question: which driver should I use, if I want to use Java? And which if I want 
to use python? 
I think the way Spark can integrate to Cassandra makes all the difference in 
the world, from my past experience, so I would like to know more about it, but 
I don't even know which source code I should start looking...
I would like to integrate using python and or C++, but I wonder if it doesn't 
pay the way to use the java driver instead.

Thanks in advance


-- 

Gaspar Muñoz 
@gmunozsoria

Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 352 59 42 // @stratiobd



Re: How to speed up SELECT * query in Cassandra

2015-02-12 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Thanks Jirka!

From: user@cassandra.apache.org 
Subject: Re: How to speed up SELECT * query in Cassandra

  Hi,

here are some snippets of code in scala which should get you started.

Jirka H.

 loop { lastRow = val query =   lastRow match { case 
Some(row)   = nextPageQuery(row, upperLimit) case None   = 
initialQuery(lowerLimit) }  session.execute(query).all }
  

 private def nextPageQuery(row: Row, 
upperLimit: String): String = {   val   tokenPart = 
token(%s)  token(0x%s) and token(%s)%s.format(rowKeyName,   
  hex(row.getBytes(rowKeyName)), rowKeyName, upperLimit)   
basicQuery.format(tokenPart)   }

  
  private def initialQuery(lowerLimit:   String):   String  
 = { val tokenPart =   token(%s) = %s.format(rowKeyName,   
lowerLimit)  basicQuery.format(tokenPart) } private def 
calculateRanges:   (BigDecimal,   BigDecimal,   
IndexedSeq[(BigDecimal,   BigDecimal)])   = { tokenRange match {
 case Some((start,   end)) = Logger.info(Token range given: {},  
+   start.underlying.toPlainString + ,+ 
end.underlying.toPlainString + ) val tokenSpaceSize =   end - start  
   val rangeSize =   tokenSpaceSize / concurrency val ranges = for (i 
- 0 until concurrency) yield (start + (i *   rangeSize), start + ((i + 1)  
 * rangeSize))  (tokenSpaceSize, rangeSize, ranges) 
case None   = val tokenSpaceSize =   partitioner.max - 
partitioner.min val rangeSize =   tokenSpaceSize / concurrency val 
ranges = for (i - 0 until concurrency) yield (partitioner.min +   (i * 
rangeSize), partitioner.min + ((i + 1) * rangeSize))  
(tokenSpaceSize, rangeSize, ranges) } }

  private   val basicQuery =   { 
select %s, %s, %s, writetime(%s) from %s where %s%s limit 
%d%s.format( rowKeyName, columnKeyName, 
columnValueName, columnValueName, columnFamily, %s, 
// template whereCondition, pageSize, if   
(cqlAllowFiltering)  allow filtering else  ) }
  

 case object   Murmur3  
 extends   Partitioner   {   override   
  val   min = BigDecimal(-2).pow(63)   
override val   max = BigDecimal(2).pow(63)  
   - 1   } case object  
 Random   extends   Partitioner   { 
  override val   min = BigDecimal(0)
   override val   max = 
BigDecimal(2).pow(127) - 1   }  
  
  
 
On 02/11/2015 02:21 PM, Ja Sam wrote:

Your answer looks very promising 


 How do you calculate start and stop?


On Wed, Feb 11, 2015 at 12:09 PM, Jiri   Horky ho...@avast.com
   wrote:
  
The fastest way I am aware of is to do the queries in parallel  
   to
multiple cassandra nodes and make sure that you only ask
 them for keys
they are responsible for. Otherwise, the node needs to 
resend your query
which is much slower and creates unnecessary objects (and   
  thus GC pressure).

You can manually take advantage of the token range 
information, if the
driver does not get this into account for you. Then, you can
 play with
concurrency and batch size of a single query against one
 node.
Basically, what you/driver should do is to transform the
 query to series
of SELECT * FROM TABLE WHERE TOKEN IN (start, stop).

I will need to look up the actual code, but the idea should 
be clear :)

Jirka H.



On 02/11/2015 11:26 AM, Ja Sam wrote:
 Is there a simple way (or even a complicated one) 
how can I speed up
 SELECT * FROM [table] query?
 I need to get all rows form one table every day. I
 split tables, and
 create one for each day, but still query is quite 
slow (200 millions
 of records)

 I was thinking about run this query in parallel,  
   but I don't know if
 it is 

Re: best supported spark connector for Cassandra

2015-02-11 Thread Marcelo Valle (BLOOMBERG/ LONDON)
I just finished a scala course, nice exercise to check what I learned :D

Thanks for the answer!

From: user@cassandra.apache.org 
Subject: Re: best supported spark connector for Cassandra

Start looking at the Spark/Cassandra connector here (in Scala): 
https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector

Data locality is provided by this method: 
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L329-L336

Start digging from this all the way down the code.

As for Stratio Deep, I can't tell how the did the integration with Spark. Take 
some time to dig down their code to understand the logic. 


On Wed, Feb 11, 2015 at 2:25 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net wrote:

Taking the opportunity Spark was being discussed in another thread, I decided 
to start a new one as I have interest in using Spark + Cassandra in the feature.

About 3 years ago, Spark was not an existing option and we tried to use hadoop 
to process Cassandra data. My experience was horrible and we reached the 
conclusion it was faster to develop an internal tool than insist on Hadoop _for 
our specific case_. 

How I can see Spark is starting to be known as a better hadoop and it seems 
market is going this way now. I can also see I have many more options to decide 
how to integrate Cassandra using the Spark RDD concept than using the 
ColumnFamilyInputFormat. 

I have found this java driver made by Datastax: 
https://github.com/datastax/spark-cassandra-connector

I also have found python Cassandra support on spark's repo, but it seems 
experimental yet: 
https://github.com/apache/spark/tree/master/examples/src/main/python

Finally I have found stratio deep: https://github.com/Stratio/deep-spark
It seems Stratio guys have forked Cassandra also, I am still a little confused 
about it.

Question: which driver should I use, if I want to use Java? And which if I want 
to use python? 
I think the way Spark can integrate to Cassandra makes all the difference in 
the world, from my past experience, so I would like to know more about it, but 
I don't even know which source code I should start looking...
I would like to integrate using python and or C++, but I wonder if it doesn't 
pay the way to use the java driver instead.

Thanks in advance




Re:How to speed up SELECT * query in Cassandra

2015-02-11 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Look for the message Re: Fastest way to map/parallel read all values in a 
table? in the mailing list, it was recently discussed. You can have several 
parallel processes each one reading a slice of the data, by splitting min/max 
murmur3 hash ranges.

In the company I used to work we developed a system to run custom python 
processes on demand to process Cassandra data among other things to be able to 
do that. I hope it will be released as open source soon, it seems there is a 
lot of people having always this same problem.

If you use Cassandra enterprise, you can use hive, AFAIK. A good idea would be 
running a hadoop or spark process over your cluster and do the processing you 
want, but sometimes I think it might be a bit hard to achieve good results for 
that, mainly because these tools work fine but are auto magic. It's hard to 
control where intermediate data will be stored, for example.


From: user@cassandra.apache.org 
Subject: Re:How to speed up SELECT * query in Cassandra

Is there a simple way (or even a complicated one) how can I speed up SELECT * 
FROM [table] query?
I need to get all rows form one table every day. I split tables, and create one 
for each day, but still query is quite slow (200 millions of records)

I was thinking about run this query in parallel, but I don't know if it is 
possible



best supported spark connector for Cassandra

2015-02-11 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Taking the opportunity Spark was being discussed in another thread, I decided 
to start a new one as I have interest in using Spark + Cassandra in the feature.

About 3 years ago, Spark was not an existing option and we tried to use hadoop 
to process Cassandra data. My experience was horrible and we reached the 
conclusion it was faster to develop an internal tool than insist on Hadoop _for 
our specific case_. 

How I can see Spark is starting to be known as a better hadoop and it seems 
market is going this way now. I can also see I have many more options to decide 
how to integrate Cassandra using the Spark RDD concept than using the 
ColumnFamilyInputFormat. 

I have found this java driver made by Datastax: 
https://github.com/datastax/spark-cassandra-connector

I also have found python Cassandra support on spark's repo, but it seems 
experimental yet: 
https://github.com/apache/spark/tree/master/examples/src/main/python

Finally I have found stratio deep: https://github.com/Stratio/deep-spark
It seems Stratio guys have forked Cassandra also, I am still a little confused 
about it.

Question: which driver should I use, if I want to use Java? And which if I want 
to use python? 
I think the way Spark can integrate to Cassandra makes all the difference in 
the world, from my past experience, so I would like to know more about it, but 
I don't even know which source code I should start looking...
I would like to integrate using python and or C++, but I wonder if it doesn't 
pay the way to use the java driver instead.

Thanks in advance




Re:Fastest way to map/parallel read all values in a table?

2015-02-09 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Just for the record, I was doing the exact same thing in an internal 
application in the start up I used to work. We have had the need of writing 
custom code process in parallel all rows of a column family. Normally we would 
use Spark for the job, but in our case the logic was a little more complicated, 
so we wrote custom code. 

What we did was to run N process in M machines (N cores in each), each one 
processing tasks. The tasks were created by splitting the range -2^ 63 to 2^ 63 
-1 in N*M*10 tasks. Even if data was not completely distributed along the 
tasks, no machines were idle, as when some task was completed another one was 
taken from the task pool.

It was fast enough for us, but I am interested in knowing if there is a better 
way of doing it.

For your specific case, here is a tool we had opened as open source and can be 
useful for simpler tests: https://github.com/s1mbi0se/cql_record_processor

Also, I guess you probably know that, but I would consider using Spark for 
doing this.

Best regards,
Marcelo.

From: user@cassandra.apache.org 
Subject: Re:Fastest way to map/parallel read all values in a table?

What’s the fastest way to map/parallel read all values in a table?

Kind of like a mini map only job.

I’m doing this to compute stats across our entire corpus.

What I did to begin with was use token() and then spit it into the number of 
splits I needed.

So I just took the total key range space which is -2^63 to 2^63 - 1 and broke 
it into N parts.

Then the queries come back as:

select * from mytable where token(primaryKey) = x and token(primaryKey)  y

From reading on this list I thought this was the correct way to handle this 
problem.

However, I’m seeing horrible performance doing this.  After about 1% it just 
flat out locks up.

Could it be that I need to randomize the token order so that it’s not 
contiguous?  Maybe it’s all mapping on the first box to begin with.


-- 

Founder/CEO Spinn3r.com
Location: San Francisco, CA
blog: http://burtonator.wordpress.com
… or check out my Google+ profile




Re:Adding more nodes causes performance problem

2015-02-09 Thread Marcelo Valle (BLOOMBERG/ LONDON)
AFAIK, if you were using RF 3 in a 3 node cluster, so all your nodes had all 
your data. 
When the number of nodes started to grow, this assumption stopped being true.
I think Cassandra will scale linearly from 9 nodes on, but comparing a 
situation where all your nodes hold all your data is not really fair, as in 
this situation Cassandra will behave as a database with two more replicas, for 
reads.
I can be wrong, but this is my call.
From: user@cassandra.apache.org 
Subject: Re:Adding more nodes causes performance problem

I have a cluster with 3 nodes, the only keyspace is with replication factor of 
3,
the application read/write UUID-keyed data. I use CQL (casssandra-python),
most writes are done by execute_async, most read are done with consistency
level of ONE, overall performance in this setup is better than I expected.

Then I test 6-nodes cluster and 9-nodes. The performance (both read and
write) was getting worse and worse. Roughly speaking, 6-nodes is about 2~3
times slower than 3-nodes, and 9-nodes is about 5~6 times slower than
3-nodes. All tests were done with same data set, same test program, same
client machines, for multiple times. I'm running Cassandra 2.1.2 with default
configuration.

What I observed, is that with 6-nodes and 9-nodes, the Cassandra servers
were doing OK with IO, but CPU utilization was about 60%~70% higher than
3-nodes.

I'd like to get suggestion how to troubleshoot this, as this is totally against
what I read, that Cassandra is scaled linearly.




Re: to normalize or not to normalize - read penalty vs write penalty

2015-02-04 Thread Marcelo Valle (BLOOMBERG/ LONDON)
I don't want to optimize for reads or writes, I want to optimize for having the 
smallest gap possible between the time I write and the time I read.
[]s

From: user@cassandra.apache.org 
Subject: Re: to normalize or not to normalize - read penalty vs write penalty

Roughly how often do you expect to update alerts?  How often do you expect to 
read the alerts?  I suspect you'll be doing 100x more reads (or more), in which 
case optimizing for reads is the definitely right choice.

On Wed, Feb 4, 2015 at 9:50 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net wrote:

Hello everyone,

I am thinking about the architecture of my application using Cassandra and I am 
asking myself if I should or shouldn't normalize an entity.

I have users and alerts in my application and for each user, several alerts. 
The first model which came into my mind was creating an alerts CF with 
user-id as part of the partition key. This way, I can have fast writes and my 
reads will be fast too, as I will always read per partition.

However, I received a requirement later that made my life more complicated. 
Alerts can be shared by 1000s of users and alerts can change. I am building a 
real time app and if I change an alert, all users related to it should see the 
change. 

Suppose I want to keep thing not normalized - always an alert changes I would 
need to do a write on 1000s of records. This way my write performance everytime 
I change an alert would be affected. 

On the other hand, I could have a CF for users-alerts and another for alert 
details. Then, at read time, I would need to query 1000s of alerts for a given 
user.

In both situations, there is a gap between the time data is written and the 
time it's available to be read. 

I understand not normalizing will make me use more disk space, but once data is 
written once, I will be able to perform as many reads as I want to with no 
penalty in performance. Also, I understand writes are faster than reads in 
Cassandra, so the gap would be smaller in the first solution.

I would be glad in hearing thoughts from the community.

Best regards,
Marcelo Valle.


-- 
Tyler Hobbs
DataStax




Re: to normalize or not to normalize - read penalty vs write penalty

2015-02-04 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Perfect Tyler.

My feeling was leading me to this, but I wasn't being able to put it in words 
as you did. 

Thanks a lot for the message.


From: user@cassandra.apache.org 
Subject: Re: to normalize or not to normalize - read penalty vs write penalty

Okay.  Let's assume with denormalization you have to do 1000 writes (and one 
read per user) and with normalization you have to do 1 write (and maybe 1000 
reads for each user).

If you execute the writes in the most optimal way (batched by partition, if 
applicable, and separate, concurrent requests per partition), I think it's 
reasonable to say you can do 1000 writes in 10 to 20ms.

Doing 1000 reads is going to take longer.  Exactly how long depends on your 
systems (SSDs or not, whether the data is cached, etc).  But this is probably 
going to take at least 2x as long as the writes. 

So, with denormalization, it's 10 to 20ms for all users to see the change (with 
a median somewhere around 5 to 10ms).  With normalization, all users *could* 
see the update almost immediately, because it's only one write.  However, each 
of your users needs to read 1000 partitions, which takes, say 20 to 50ms.  So 
effectively, they won't see the changes for 20 to 50ms, unless they know to 
read the details for that exact alert.

On Wed, Feb 4, 2015 at 11:57 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net wrote:

I don't want to optimize for reads or writes, I want to optimize for having the 
smallest gap possible between the time I write and the time I read.
[]s

From: user@cassandra.apache.org 
Subject: Re: to normalize or not to normalize - read penalty vs write penalty

Roughly how often do you expect to update alerts?  How often do you expect to 
read the alerts?  I suspect you'll be doing 100x more reads (or more), in which 
case optimizing for reads is the definitely right choice.

On Wed, Feb 4, 2015 at 9:50 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net wrote:

Hello everyone,

I am thinking about the architecture of my application using Cassandra and I am 
asking myself if I should or shouldn't normalize an entity.

I have users and alerts in my application and for each user, several alerts. 
The first model which came into my mind was creating an alerts CF with 
user-id as part of the partition key. This way, I can have fast writes and my 
reads will be fast too, as I will always read per partition.

However, I received a requirement later that made my life more complicated. 
Alerts can be shared by 1000s of users and alerts can change. I am building a 
real time app and if I change an alert, all users related to it should see the 
change. 

Suppose I want to keep thing not normalized - always an alert changes I would 
need to do a write on 1000s of records. This way my write performance everytime 
I change an alert would be affected. 

On the other hand, I could have a CF for users-alerts and another for alert 
details. Then, at read time, I would need to query 1000s of alerts for a given 
user.

In both situations, there is a gap between the time data is written and the 
time it's available to be read. 

I understand not normalizing will make me use more disk space, but once data is 
written once, I will be able to perform as many reads as I want to with no 
penalty in performance. Also, I understand writes are faster than reads in 
Cassandra, so the gap would be smaller in the first solution.

I would be glad in hearing thoughts from the community.

Best regards,
Marcelo Valle.


-- 
Tyler Hobbs
DataStax


-- 
Tyler Hobbs
DataStax




to normalize or not to normalize - read penalty vs write penalty

2015-02-04 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Hello everyone,

I am thinking about the architecture of my application using Cassandra and I am 
asking myself if I should or shouldn't normalize an entity.

I have users and alerts in my application and for each user, several alerts. 
The first model which came into my mind was creating an alerts CF with 
user-id as part of the partition key. This way, I can have fast writes and my 
reads will be fast too, as I will always read per partition.

However, I received a requirement later that made my life more complicated. 
Alerts can be shared by 1000s of users and alerts can change. I am building a 
real time app and if I change an alert, all users related to it should see the 
change. 

Suppose I want to keep thing not normalized - always an alert changes I would 
need to do a write on 1000s of records. This way my write performance everytime 
I change an alert would be affected. 

On the other hand, I could have a CF for users-alerts and another for alert 
details. Then, at read time, I would need to query 1000s of alerts for a given 
user.

In both situations, there is a gap between the time data is written and the 
time it's available to be read. 

I understand not normalizing will make me use more disk space, but once data is 
written once, I will be able to perform as many reads as I want to with no 
penalty in performance. Also, I understand writes are faster than reads in 
Cassandra, so the gap would be smaller in the first solution.

I would be glad in hearing thoughts from the community.

Best regards,
Marcelo Valle.

Re: data distribution along column family partitions

2015-02-04 Thread Marcelo Valle (BLOOMBERG/ LONDON)
From: clohfin...@gmail.com 
Subject: Re: data distribution along column family partitions

 not ok :) don't let a single partition get to 1gb, 100's of mb should be when 
 flares are going up.  The main reasoning is compactions would be horrifically 
 slow and there will be a lot of gc pain. Bringing the time bucket to be by 
 day will probably be sufficient. It would take billions of alarm events in 
 single time bucket if thats entire data payload to get that bad.
 Wide rows work well, the keeping them smaller is an optimization that will 
 save you a lot of pain down the road from troublesome jvm gcs, slower 
 compactions, unbalanced nodes, and higher read latencies.

That's the point, I won't have many partitions with more than 15gb. But suppose 
will have for 1000 users, among 10 million. Almost all partitions will have a 
good size, but I won't have a problem with the few ones which are big then? 

I am asking this because in prior experience I felt I was having a huge 
performance penalty reading updates from these 1000 users, like, I might have 
few cases, but assuming every time data changes I will have to process the user 
again, I will read the worst case very often. 

Chris

On Wed, Feb 4, 2015 at 9:33 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net wrote:

 The data model lgtm. You may need to balance the size of the time buckets 
 with the amount of alarms to prevent partitions from getting too large. 1
month may be a little large, I would aim to keep the partitions below 25mb (can 
check with nodetool cfstats) or so in size to keep everything happy. Its ok if 
occasional ones go larger, something like 1gb can be bad.. but it would still 
work if not very efficiently.

What about 15 gb?

 Deletes on an entire time-bucket at a time seems like a good approach, but 
 just setting TTL would be far far better imho (why not just set it to two 
 years?). May want to look into new DateTieredCompactionStrategy, or 
 LeveledCompactionStrategy or the obsoleted data will very rarely go away.

Excellent hint, I will take a good look on this. I didn't know 
DateTieredCompactionStrategy

 When reading just be sure to use paging (the good cql drivers will have it 
 built in) and don't actually read it all in one massive query. If you 
 decrease size of your time bucket you may end up having to page the query 
 across multiple partitions if Y-X  bucket size.

If I use paging, Cassandra won't try to allocate the whole partition on the 
server node, it will just allocate memory in the heap for that page. Check?

Marcelo Valle
From: user@cassandra.apache.org 
Subject: Re: data distribution along column family partitions

The data model lgtm.  You may need to balance the size of the time buckets with 
the amount of alarms to prevent partitions from getting too large.  1 month may 
be a little large, I would aim to keep the partitions below 25mb (can check 
with nodetool cfstats) or so in size to keep everything happy.  Its ok if 
occasional ones go larger, something like 1gb can be bad.. but it would still 
work if not very efficiently.

Deletes on an entire time-bucket at a time seems like a good approach, but just 
setting TTL would be far far better imho (why not just set it to two years?).  
May want to look into new DateTieredCompactionStrategy, or 
LeveledCompactionStrategy or the obsoleted data will very rarely go away.

When reading just be sure to use paging (the good cql drivers will have it 
built in) and don't actually read it all in one massive query.  If you decrease 
size of your time bucket you may end up having to page the query across 
multiple partitions if Y-X  bucket size.

Chris

On Wed, Feb 4, 2015 at 4:34 AM, Marcelo Elias Del Valle mvall...@gmail.com 
wrote:

Hello,

I am designing a model to store alerts users receive over time. I will want to 
store probably the last two years of alerts for each user.

The first thought I had was having a column family partitioned by user + 
timebucket, where time bucket could be something like year + month. For 
instance:

partition key:
user-id = f47ac10b-58cc-4372-a567-0e02b2c3d479
time-bucket = 201502
rest of primary key:
timestamp = column of tipy timestamp
alert id = f47ac10b-58cc-4372-a567-0e02b2c3d480

Question, would this make it easier to delete old data? Supposing I am not 
using TTL and I want to remove alerts older than 2 years, what would be better, 
just deleting the entire time-bucket for each user-id (through a map/reduce 
process) or having just user-id as partition key and deleting, for each user, 
where X  timestamp  Y? 

Is it the same for Cassandra, internally?

Another question is: would data be distributed enough if I just choose to 
partition by user-id? I will have some users with a large number of alerts, but 
in average I could consider alerts would have a good distribution along user 
ids. The problem is I don't fell confident having few partitions with A LOT of 
alerts would not be a problem at read time

Re: data distribution along column family partitions

2015-02-04 Thread Marcelo Valle (BLOOMBERG/ LONDON)
 The data model lgtm. You may need to balance the size of the time buckets 
 with the amount of alarms to prevent partitions from getting too large. 1
month may be a little large, I would aim to keep the partitions below 25mb (can 
check with nodetool cfstats) or so in size to keep everything happy. Its ok if 
occasional ones go larger, something like 1gb can be bad.. but it would still 
work if not very efficiently.

What about 15 gb?

 Deletes on an entire time-bucket at a time seems like a good approach, but 
 just setting TTL would be far far better imho (why not just set it to two 
 years?). May want to look into new DateTieredCompactionStrategy, or 
 LeveledCompactionStrategy or the obsoleted data will very rarely go away.

Excellent hint, I will take a good look on this. I didn't know 
DateTieredCompactionStrategy

 When reading just be sure to use paging (the good cql drivers will have it 
 built in) and don't actually read it all in one massive query. If you 
 decrease size of your time bucket you may end up having to page the query 
 across multiple partitions if Y-X  bucket size.

If I use paging, Cassandra won't try to allocate the whole partition on the 
server node, it will just allocate memory in the heap for that page. Check?

Marcelo Valle
From: user@cassandra.apache.org 
Subject: Re: data distribution along column family partitions

The data model lgtm.  You may need to balance the size of the time buckets with 
the amount of alarms to prevent partitions from getting too large.  1 month may 
be a little large, I would aim to keep the partitions below 25mb (can check 
with nodetool cfstats) or so in size to keep everything happy.  Its ok if 
occasional ones go larger, something like 1gb can be bad.. but it would still 
work if not very efficiently.

Deletes on an entire time-bucket at a time seems like a good approach, but just 
setting TTL would be far far better imho (why not just set it to two years?).  
May want to look into new DateTieredCompactionStrategy, or 
LeveledCompactionStrategy or the obsoleted data will very rarely go away.

When reading just be sure to use paging (the good cql drivers will have it 
built in) and don't actually read it all in one massive query.  If you decrease 
size of your time bucket you may end up having to page the query across 
multiple partitions if Y-X  bucket size.

Chris

On Wed, Feb 4, 2015 at 4:34 AM, Marcelo Elias Del Valle mvall...@gmail.com 
wrote:

Hello,

I am designing a model to store alerts users receive over time. I will want to 
store probably the last two years of alerts for each user.

The first thought I had was having a column family partitioned by user + 
timebucket, where time bucket could be something like year + month. For 
instance:

partition key:
user-id = f47ac10b-58cc-4372-a567-0e02b2c3d479
time-bucket = 201502
rest of primary key:
timestamp = column of tipy timestamp
alert id = f47ac10b-58cc-4372-a567-0e02b2c3d480

Question, would this make it easier to delete old data? Supposing I am not 
using TTL and I want to remove alerts older than 2 years, what would be better, 
just deleting the entire time-bucket for each user-id (through a map/reduce 
process) or having just user-id as partition key and deleting, for each user, 
where X  timestamp  Y? 

Is it the same for Cassandra, internally?

Another question is: would data be distributed enough if I just choose to 
partition by user-id? I will have some users with a large number of alerts, but 
in average I could consider alerts would have a good distribution along user 
ids. The problem is I don't fell confident having few partitions with A LOT of 
alerts would not be a problem at read time. 

What happens at read time when I try to read data from a big partition? 
Like, I want to read alerts for a user where X  timestamp  Y, but it would 
return 1 million alerts. As it's all in a single partition, this read will 
occur in the same node, thus allocating a lot of memory for this single 
operation, right? 

What if the memory needed for this operation is bigger than it fits in java 
heap? Would this be a problem to Cassandra?


Best regards,
-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr