Re: Spark and intermediate results

2015-10-09 Thread Marcelo Valle (BLOOMBERG/ LONDON)
any doc about it... From: user@cassandra.apache.org Subject: Re: Spark and intermediate results You can run spark against your Cassandra data directly without using a shared filesystem. https://github.com/datastax/spark-cassandra-connector On Fri, Oct 9, 2015 at 6:09 AM Marcelo Valle

Spark and intermediate results

2015-10-09 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Hello, I saw this nice link from an event: http://www.datastax.com/dev/blog/zen-art-spark-maintenance?mkt_tok=3RkMMJWWfF9wsRogvqzIZKXonjHpfsX56%2B8uX6GylMI%2F0ER3fOvrPUfGjI4GTcdmI%2BSLDwEYGJlv6SgFSrXMMblswLgIXBY%3D I would like to test using Spark to perform some operations on a column family,

Re: ScyllaDB, a new open source, Cassandra-compatible NoSQL

2015-09-23 Thread Marcelo Valle (BLOOMBERG/ LONDON)
I think there is a very important point in Scylladb - latency. Performance can be an important requirement, but the fact scylladb is written in C and uses lock free algorithms inside means it should have lower latency than Cassandra, which enables it's use for a wider range of applications. It

Re: how many rows can one partion key hold?

2015-02-27 Thread Marcelo Valle (BLOOMBERG/ LONDON)
When one partition's data is extreme large, the write/read will slow? This is actually a good question, If a partition has near 2 billion rows, will writes or reads get too slow? My understanding is it shouldn't, as data is indexed inside a partition and when you read or write you are doing a

Re: Unexplained query slowness

2015-02-26 Thread Marcelo Valle (BLOOMBERG/ LONDON)
, then investigate things like disk latency, noisy neighbours (if you are on vms/ in the cloud). On 26 February 2015 at 03:01, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: I am sorry if it's too basic and you already looked at that, but the first thing I would ask would

Re:Unexplained query slowness

2015-02-25 Thread Marcelo Valle (BLOOMBERG/ LONDON)
I am sorry if it's too basic and you already looked at that, but the first thing I would ask would be the data model. What data model are you using (how is your data partitioned)? What queries are you running? If you are using ALLOW FILTERING, for instance, it will be very easy to say why it's

Re:Cassandra Read Timeout

2015-02-24 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Yulian, Maybe other people have other clues, but I think if you could monitor the behavior in tpstats after activity Seeking to partition beginning in data file it could help to find the problem. Which type of thread is getting stuck? Do you see any number increasing continuously during the

Re: Cassandra Read Timeout

2015-02-24 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Indeed, I thought something odd could be happening to your cluster, but it seems it's working fine but the request is taking too long to complete. I noticed from your cfstats the read count was about 10 in the first CF and in the second one it was about 1000... Would you be doing much more

Re: Cassandra Read Timeout

2015-02-24 Thread Marcelo Valle (BLOOMBERG/ LONDON)
, 2015 at 6:57 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: Super column? Out of curiosity, which Cassandra version are you running? From: user@cassandra.apache.org Subject: Re: Cassandra Read Timeout Hello The structure is the same , the CFs are super column CFs

Re: Cassandra Read Timeout

2015-02-24 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Super column? Out of curiosity, which Cassandra version are you running? From: user@cassandra.apache.org Subject: Re: Cassandra Read Timeout Hello The structure is the same , the CFs are super column CFs , where key is long ( timestamp to partition the index , so each 11 days new row is

Re:designing table

2015-02-20 Thread Marcelo Valle (BLOOMBERG/ LONDON)
My cents: You could partition your data per date and second query would be easy. If you need to query ALL data for a client id, it would be hard though, but querying last 10 days for a client id could be easy, for instance. If you need to query ALL, it would probably be better to create another

Re:PySpark and Cassandra integration

2015-02-20 Thread Marcelo Valle (BLOOMBERG/ LONDON)
I will try it for sure Frens, very nice! Thanks for sharing! From: user@cassandra.apache.org Subject: Re:PySpark and Cassandra integration Hi all, Wanted to let you know I've forked PySpark Cassandra on https://github.com/TargetHolding/pyspark-cassandra. Unfortunately the original code

Re:query by column size

2015-02-13 Thread Marcelo Valle (BLOOMBERG/ LONDON)
There is no automatic indexing in Cassandra. There are secondary indexes, but not for these cases. You could use a solution like DSE, to get data automatically indexed on solr, in each node, as soon as data comes. Then you could do such a query on solr. If the query can be slow, you could run a

Re: best supported spark connector for Cassandra

2015-02-13 Thread Marcelo Valle (BLOOMBERG/ LONDON)
: Re: best supported spark connector for Cassandra Of course, Stratio Deep and Stratio Cassandra are licensed Apache 2.0. Regarding the Cassandra support, I can introduce you to someone in Stratio that can help you. 2015-02-12 15:05 GMT+01:00 Marcelo Valle (BLOOMBERG/ LONDON) mvallemil

Re: best supported spark connector for Cassandra

2015-02-13 Thread Marcelo Valle (BLOOMBERG/ LONDON)
, Regards, Carlos Juzarte Rolo Cassandra Consultant Pythian - Love your data rolo@pythian | Twitter: cjrolo | Linkedin: linkedin.com/in/carlosjuzarterolo Tel: 1649 www.pythian.com On Fri, Feb 13, 2015 at 3:05 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: Actually, I

Re: best supported spark connector for Cassandra

2015-02-12 Thread Marcelo Valle (BLOOMBERG/ LONDON)
, /Shahab On Wed, Feb 11, 2015 at 2:51 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: I just finished a scala course, nice exercise to check what I learned :D Thanks for the answer! From: user@cassandra.apache.org Subject: Re: best supported spark connector for Cassandra

Re: How to speed up SELECT * query in Cassandra

2015-02-12 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Thanks Jirka! From: user@cassandra.apache.org Subject: Re: How to speed up SELECT * query in Cassandra Hi, here are some snippets of code in scala which should get you started. Jirka H. loop { lastRow = val query = lastRow

Re: best supported spark connector for Cassandra

2015-02-11 Thread Marcelo Valle (BLOOMBERG/ LONDON)
-L336 Start digging from this all the way down the code. As for Stratio Deep, I can't tell how the did the integration with Spark. Take some time to dig down their code to understand the logic. On Wed, Feb 11, 2015 at 2:25 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote

Re:How to speed up SELECT * query in Cassandra

2015-02-11 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Look for the message Re: Fastest way to map/parallel read all values in a table? in the mailing list, it was recently discussed. You can have several parallel processes each one reading a slice of the data, by splitting min/max murmur3 hash ranges. In the company I used to work we developed a

best supported spark connector for Cassandra

2015-02-11 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Taking the opportunity Spark was being discussed in another thread, I decided to start a new one as I have interest in using Spark + Cassandra in the feature. About 3 years ago, Spark was not an existing option and we tried to use hadoop to process Cassandra data. My experience was horrible and

Re:Fastest way to map/parallel read all values in a table?

2015-02-09 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Just for the record, I was doing the exact same thing in an internal application in the start up I used to work. We have had the need of writing custom code process in parallel all rows of a column family. Normally we would use Spark for the job, but in our case the logic was a little more

Re:Adding more nodes causes performance problem

2015-02-09 Thread Marcelo Valle (BLOOMBERG/ LONDON)
AFAIK, if you were using RF 3 in a 3 node cluster, so all your nodes had all your data. When the number of nodes started to grow, this assumption stopped being true. I think Cassandra will scale linearly from 9 nodes on, but comparing a situation where all your nodes hold all your data is not

Re: to normalize or not to normalize - read penalty vs write penalty

2015-02-04 Thread Marcelo Valle (BLOOMBERG/ LONDON)
to update alerts? How often do you expect to read the alerts? I suspect you'll be doing 100x more reads (or more), in which case optimizing for reads is the definitely right choice. On Wed, Feb 4, 2015 at 9:50 AM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: Hello

Re: to normalize or not to normalize - read penalty vs write penalty

2015-02-04 Thread Marcelo Valle (BLOOMBERG/ LONDON)
the changes for 20 to 50ms, unless they know to read the details for that exact alert. On Wed, Feb 4, 2015 at 11:57 AM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: I don't want to optimize for reads or writes, I want to optimize for having the smallest gap possible between

to normalize or not to normalize - read penalty vs write penalty

2015-02-04 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Hello everyone, I am thinking about the architecture of my application using Cassandra and I am asking myself if I should or shouldn't normalize an entity. I have users and alerts in my application and for each user, several alerts. The first model which came into my mind was creating an

Re: data distribution along column family partitions

2015-02-04 Thread Marcelo Valle (BLOOMBERG/ LONDON)
. Chris On Wed, Feb 4, 2015 at 9:33 AM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: The data model lgtm. You may need to balance the size of the time buckets with the amount of alarms to prevent partitions from getting too large. 1 month may be a little large, I would

Re: data distribution along column family partitions

2015-02-04 Thread Marcelo Valle (BLOOMBERG/ LONDON)
The data model lgtm. You may need to balance the size of the time buckets with the amount of alarms to prevent partitions from getting too large. 1 month may be a little large, I would aim to keep the partitions below 25mb (can check with nodetool cfstats) or so in size to keep everything