Re: 1000's of column families
Ben, to address your question, read my last post but to summarize, yes, there is less overhead in memory to prefix keys than manage multiple Cfs EXCEPT when doing map/reduce. Doing map/reduce, you will now have HUGE overhead in reading a whole slew of rows you don't care about as you can't map/reduce a single virtual CF but must map/reduce the whole CF wasting TONS of resources. Thanks, Dean On 10/1/12 3:38 PM, Ben Hood 0x6e6...@gmail.com wrote: On Mon, Oct 1, 2012 at 9:38 PM, Brian O'Neill b...@alumni.brown.edu wrote: Its just a convenient way of prefixing: http://hector-client.github.com/hector/build/html/content/virtual_keyspac es.html So given that it is possible to use a CF per tenant, should we assume that there at sufficient scale that there is less overhead to prefix keys than there is to manage multiple CFs? Ben
Re: 1000's of column families
Without putting too much thought into it... Given the underlying architecture, I think you could/would have to write your own partitioner, which would partition based on the prefix/virtual keyspace. -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 10/2/12 9:00 AM, Ben Hood 0x6e6...@gmail.com wrote: Dean, On Tue, Oct 2, 2012 at 1:37 PM, Hiller, Dean dean.hil...@nrel.gov wrote: Ben, to address your question, read my last post but to summarize, yes, there is less overhead in memory to prefix keys than manage multiple Cfs EXCEPT when doing map/reduce. Doing map/reduce, you will now have HUGE overhead in reading a whole slew of rows you don't care about as you can't map/reduce a single virtual CF but must map/reduce the whole CF wasting TONS of resources. That's a good point that I hadn't considered beforehand, especially as I'd like to run MR jobs against these CFs. Is this limitation inherent in the way that Cassandra is modelled as input for Hadoop or could you write a custom slice query to only feed in one particular prefix into Hadoop? Cheers, Ben
Re: 1000's of column families
Thanks for the idea but…(but please keep thinking on it)... 100% what we don't want since partitioned data resides on the same node. I want to map/reduce the column families and leverage the parallel disks :( :( I am sure others would want to do the same…..We almost need a feature of virtual Column Families and column family should really not be column family but should be called ReplicationGroup or something where replication is configured for all CF's in that group. ANYONE have any other ideas??? Dean On 10/2/12 7:20 AM, Brian O'Neill boneil...@gmail.com wrote: Without putting too much thought into it... Given the underlying architecture, I think you could/would have to write your own partitioner, which would partition based on the prefix/virtual keyspace. -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive € King of Prussia, PA € 19406 M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42 € healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 10/2/12 9:00 AM, Ben Hood 0x6e6...@gmail.com wrote: Dean, On Tue, Oct 2, 2012 at 1:37 PM, Hiller, Dean dean.hil...@nrel.gov wrote: Ben, to address your question, read my last post but to summarize, yes, there is less overhead in memory to prefix keys than manage multiple Cfs EXCEPT when doing map/reduce. Doing map/reduce, you will now have HUGE overhead in reading a whole slew of rows you don't care about as you can't map/reduce a single virtual CF but must map/reduce the whole CF wasting TONS of resources. That's a good point that I hadn't considered beforehand, especially as I'd like to run MR jobs against these CFs. Is this limitation inherent in the way that Cassandra is modelled as input for Hadoop or could you write a custom slice query to only feed in one particular prefix into Hadoop? Cheers, Ben
Re: 1000's of column families
Agreed. Do we know yet what the overhead is for each column family? What is the limit? If you have a SINGLE keyspace w/ 2+ CF's, what happens? Anyone know? -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive • King of Prussia, PA • 19406 M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42 • healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 10/2/12 9:28 AM, Hiller, Dean dean.hil...@nrel.gov wrote: Thanks for the idea but…(but please keep thinking on it)... 100% what we don't want since partitioned data resides on the same node. I want to map/reduce the column families and leverage the parallel disks :( :( I am sure others would want to do the same…..We almost need a feature of virtual Column Families and column family should really not be column family but should be called ReplicationGroup or something where replication is configured for all CF's in that group. ANYONE have any other ideas??? Dean On 10/2/12 7:20 AM, Brian O'Neill boneil...@gmail.com wrote: Without putting too much thought into it... Given the underlying architecture, I think you could/would have to write your own partitioner, which would partition based on the prefix/virtual keyspace. -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive € King of Prussia, PA € 19406 M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42 € healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 10/2/12 9:00 AM, Ben Hood 0x6e6...@gmail.com wrote: Dean, On Tue, Oct 2, 2012 at 1:37 PM, Hiller, Dean dean.hil...@nrel.gov wrote: Ben, to address your question, read my last post but to summarize, yes, there is less overhead in memory to prefix keys than manage multiple Cfs EXCEPT when doing map/reduce. Doing map/reduce, you will now have HUGE overhead in reading a whole slew of rows you don't care about as you can't map/reduce a single virtual CF but must map/reduce the whole CF wasting TONS of resources. That's a good point that I hadn't considered beforehand, especially as I'd like to run MR jobs against these CFs. Is this limitation inherent in the way that Cassandra is modelled as input for Hadoop or could you write a custom slice query to only feed in one particular prefix into Hadoop? Cheers, Ben
Re: 1000's of column families
Brian, On Tue, Oct 2, 2012 at 2:20 PM, Brian O'Neill boneil...@gmail.com wrote: Without putting too much thought into it... Given the underlying architecture, I think you could/would have to write your own partitioner, which would partition based on the prefix/virtual keyspace. I might be barking up the wrong tree here, but looking at source of ColumnFamilyInputFormat, it seems that you can specify a KeyRange for the input, but only when you use an order preserving partitioner. So I presume that if you are using the RandomPartitioner, you are effectively doing a full CF scan (i.e. including all tenants in your system). Ben
Re: 1000's of column families
Exactly. --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive King of Prussia, PA 19406 M: 215.588.6024 @boneill42 http://www.twitter.com/boneill42 healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 10/2/12 9:55 AM, Ben Hood 0x6e6...@gmail.com wrote: Brian, On Tue, Oct 2, 2012 at 2:20 PM, Brian O'Neill boneil...@gmail.com wrote: Without putting too much thought into it... Given the underlying architecture, I think you could/would have to write your own partitioner, which would partition based on the prefix/virtual keyspace. I might be barking up the wrong tree here, but looking at source of ColumnFamilyInputFormat, it seems that you can specify a KeyRange for the input, but only when you use an order preserving partitioner. So I presume that if you are using the RandomPartitioner, you are effectively doing a full CF scan (i.e. including all tenants in your system). Ben
Re: 1000's of column families
On Tue, Oct 2, 2012 at 3:37 PM, Brian O'Neill boneil...@gmail.com wrote: Exactly. So you're back to the deliberation between using multiple CFs (potentially with some known working upper bound*) or feeding your map reduce in some other way (as you decided to do with Storm). In my particular scenario I'd like to be able to do a combination of some batch processing on top of less frequently changing data (hence why I was looking at Hadoop) and some real time analytics. Cheers, Ben (*) Not sure whether this applies to an individual keyspace or an entire cluster.
Re: 1000's of column families
Ben, Brian, By the way, PlayOrm offers a NoSqlTypedSession that is different than the ORM half of PlayOrm dealing in raw stuff that does indexing(so you can do Scalable SQL on data that has no ORM on top of it). That is what we use for our 1000's of CF's as we don't know the format of any of those tables ahead of time(in our world, users tell us the format and wire in streams through an api we expose AND they tell PlayOrm which columns to index). That layer deals with BigInteger, BigDecimal, String and I think byte[]. So, I am going to add virtual CF's to PlayOrm in the coming week and we are going to feed in streams and partition the virtual CF's which sit in a single real CF using PlayOrm partitioning and then we can then query into each partition. The only issue is really what partitions exist and that is left to the client to keep track of, but if your app knows all the partitions(and that could be saved to some rows in the nosql store), then I will probably try out storm after that. Later, Dean On 10/2/12 9:09 AM, Ben Hood 0x6e6...@gmail.com wrote: On Tue, Oct 2, 2012 at 3:37 PM, Brian O'Neill boneil...@gmail.com wrote: Exactly. So you're back to the deliberation between using multiple CFs (potentially with some known working upper bound*) or feeding your map reduce in some other way (as you decided to do with Storm). In my particular scenario I'd like to be able to do a combination of some batch processing on top of less frequently changing data (hence why I was looking at Hadoop) and some real time analytics. Cheers, Ben (*) Not sure whether this applies to an individual keyspace or an entire cluster.
Re: 1000's of column families
Another option that may or may not work for you is the support in Cassandra 1.1+ to use a secondary index as an input to your mapreduce job. What you might do is add a field to the column family that represents which virtual column family that it is part of. Then when doing mapreduce jobs, you could use that field as the secondary index limiter. Secondary index mapreduce is not as efficient since you first get all of the keys and then do multigets to get the data that you need for the mapreduce job. However, it's another option for not scanning the whole column family. On Oct 2, 2012, at 10:09 AM, Ben Hood 0x6e6...@gmail.com wrote: On Tue, Oct 2, 2012 at 3:37 PM, Brian O'Neill boneil...@gmail.com wrote: Exactly. So you're back to the deliberation between using multiple CFs (potentially with some known working upper bound*) or feeding your map reduce in some other way (as you decided to do with Storm). In my particular scenario I'd like to be able to do a combination of some batch processing on top of less frequently changing data (hence why I was looking at Hadoop) and some real time analytics. Cheers, Ben (*) Not sure whether this applies to an individual keyspace or an entire cluster.
Re: 1000's of column families
Jeremy, On Tuesday, October 2, 2012 at 17:06, Jeremy Hanna wrote: Another option that may or may not work for you is the support in Cassandra 1.1+ to use a secondary index as an input to your mapreduce job. What you might do is add a field to the column family that represents which virtual column family that it is part of. Then when doing mapreduce jobs, you could use that field as the secondary index limiter. Secondary index mapreduce is not as efficient since you first get all of the keys and then do multigets to get the data that you need for the mapreduce job. However, it's another option for not scanning the whole column family. Interesting. This is probably a stupid question but why shouldn't you be able to use the secondary index to go straight to the slices that belong to the attribute you are searching by? Is this something to do with the way Cassandra is exposed as an InputFormat for Hadoop or is this a general property for searching by secondary index? Ben
Re: 1000's of column families
Because the data for an index is not all together(ie. Need a multi get to get the data). It is not contiguous. The prefix in a partition they keep the data so all data for a prefix from what I understand is contiguous. QUESTION: What I don't get in the comment is I assume you are referring to CQL in which case we would need to specify the partition (in addition to the index)which means all that data is on one node, correct? Or did I miss something there. Thanks, Dean From: Ben Hood 0x6e6...@gmail.commailto:0x6e6...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Tuesday, October 2, 2012 11:18 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: 1000's of column families Jeremy, On Tuesday, October 2, 2012 at 17:06, Jeremy Hanna wrote: Another option that may or may not work for you is the support in Cassandra 1.1+ to use a secondary index as an input to your mapreduce job. What you might do is add a field to the column family that represents which virtual column family that it is part of. Then when doing mapreduce jobs, you could use that field as the secondary index limiter. Secondary index mapreduce is not as efficient since you first get all of the keys and then do multigets to get the data that you need for the mapreduce job. However, it's another option for not scanning the whole column family. Interesting. This is probably a stupid question but why shouldn't you be able to use the secondary index to go straight to the slices that belong to the attribute you are searching by? Is this something to do with the way Cassandra is exposed as an InputFormat for Hadoop or is this a general property for searching by secondary index? Ben
Re: 1000's of column families
Dean, On Tuesday, October 2, 2012 at 18:52, Hiller, Dean wrote: Because the data for an index is not all together(ie. Need a multi get to get the data). It is not contiguous. The prefix in a partition they keep the data so all data for a prefix from what I understand is contiguous. So you're saying that you can access the primary index with a key range, but to access the secondary index, you first need to get all keys and follow up with a multiget, which would use the secondary index to speed the lookup of the matching rows? QUESTION: What I don't get in the comment is I assume you are referring to CQL in which case we would need to specify the partition (in addition to the index)which means all that data is on one node, correct? Or did I miss something there. Maybe my question was just silly - I wasn't referring to CQL. As for the locality of the data, I was hoping to be able to fire off an MR job to process all matching rows in the CF - I was assuming that that this job would get executed on the same node as the data. But I think the real confusion in my question has to do with the way the ColumnFamilyInputFormat has been implemented, since it would appear that it ingests the entire (non-OPP) CF into Hadoop, such that the predicate needs to be applied in the job rather than up front in the Cassandra query. Cheers, Ben
Re: 1000's of column families
So you're saying that you can access the primary index with a key range, but to access the secondary index, you first need to get all keys and follow up with a multiget, which would use the secondary index to speed the lookup of the matching rows? Yes, that is how I believe it works. I am by no means an expert. I also wanted to fire off a MR to process matching rows in the virtual CF ideally running on the nodes where it reads data in. In 0.7, I thought the M/R jobs did not run locally with the data like hadoop does??? Anyone know if that is still true or does it run locally to the data now? Thanks, Dean From: Ben Hood 0x6e6...@gmail.commailto:0x6e6...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Tuesday, October 2, 2012 1:01 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: 1000's of column families Dean, On Tuesday, October 2, 2012 at 18:52, Hiller, Dean wrote: Because the data for an index is not all together(ie. Need a multi get to get the data). It is not contiguous. The prefix in a partition they keep the data so all data for a prefix from what I understand is contiguous. QUESTION: What I don't get in the comment is I assume you are referring to CQL in which case we would need to specify the partition (in addition to the index)which means all that data is on one node, correct? Or did I miss something there. Maybe my question was just silly - I wasn't referring to CQL. As for the locality of the data, I was hoping to be able to fire off an MR job to process all matching rows in the CF - I was assuming that that this job would get executed on the same node as the data. But I think the real confusion in my question has to do with the way the ColumnFamilyInputFormat has been implemented, since it would appear that it ingests the entire (non-OPP) CF into Hadoop, such that the predicate needs to be applied in the job rather than up front in the Cassandra query. Cheers, Ben
Re: 1000's of column families
It's always had data locality (since hadoop support was added in 0.6). You don't need to specify a partition, you specify the input predicate with ConfigHelper or the cassandra.input.predicate property. On Oct 2, 2012, at 2:26 PM, Hiller, Dean dean.hil...@nrel.gov wrote: So you're saying that you can access the primary index with a key range, but to access the secondary index, you first need to get all keys and follow up with a multiget, which would use the secondary index to speed the lookup of the matching rows? Yes, that is how I believe it works. I am by no means an expert. I also wanted to fire off a MR to process matching rows in the virtual CF ideally running on the nodes where it reads data in. In 0.7, I thought the M/R jobs did not run locally with the data like hadoop does??? Anyone know if that is still true or does it run locally to the data now? Thanks, Dean From: Ben Hood 0x6e6...@gmail.commailto:0x6e6...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Tuesday, October 2, 2012 1:01 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: 1000's of column families Dean, On Tuesday, October 2, 2012 at 18:52, Hiller, Dean wrote: Because the data for an index is not all together(ie. Need a multi get to get the data). It is not contiguous. The prefix in a partition they keep the data so all data for a prefix from what I understand is contiguous. QUESTION: What I don't get in the comment is I assume you are referring to CQL in which case we would need to specify the partition (in addition to the index)which means all that data is on one node, correct? Or did I miss something there. Maybe my question was just silly - I wasn't referring to CQL. As for the locality of the data, I was hoping to be able to fire off an MR job to process all matching rows in the CF - I was assuming that that this job would get executed on the same node as the data. But I think the real confusion in my question has to do with the way the ColumnFamilyInputFormat has been implemented, since it would appear that it ingests the entire (non-OPP) CF into Hadoop, such that the predicate needs to be applied in the job rather than up front in the Cassandra query. Cheers, Ben
Re: 1000's of column families
Dean, We have the same question... We have thousands of separate feeds of data as well (20,000+). To date, we've been using a CF per feed strategy, but as we scale this thing out to accommodate all of those feeds, we're trying to figure out if we're going to blow out the memory. The initial documentation for heap sizing had column families in the equation: http://www.datastax.com/docs/0.7/operations/tuning#heap-sizing But in the more recent documentation, it looks like they removed the column family variable with the introduction of the universal key_cache_size. http://www.datastax.com/docs/1.0/operations/tuning#tuning-java-heap-size We haven't committed either way yet, but given Ed Anuff's presentation on virtual keyspaces, we were leaning towards a single column family approach: http://blog.apigee.com/detail/building_a_mobile_data_platform_with_cassandra_-_apigee_under_the_hood/? Definitely let us know what you decide. -brian On Fri, Sep 28, 2012 at 11:48 AM, Flavio Baronti f.baro...@list-group.com wrote: We had some serious trouble with dynamically adding CFs, although last time we tried we were using version 0.7, so maybe that's not an issue any more. Our problems were two: - You are (were?) not supposed to add CFs concurrently. Since we had more servers talking to the same Cassandra cluster, we had to use distributed locks (Hazelcast) to avoid concurrency. - You must be very careful to add new CFs to different Cassandra nodes. If you do that fast enough, and the clocks of the two servers are skewed, you will severely compromise your schema (Cassandra will not understand in which order the updates must be applied). As I said, this applied to version 0.7, maybe current versions solved these problems. Flavio Il 2012/09/27 16:11 PM, Hiller, Dean ha scritto: We have 1000's of different building devices and we stream data from these devices. The format and data from each one varies so one device has temperature at timeX with some other variables, another device has CO2 percentage and other variables. Every device is unique and streams it's own data. We dynamically discover devices and register them. Basically, one CF or table per thing really makes sense in this environment. While we could try to find out which devices are similar, this would really be a pain and some devices add some new variable into the equation. NOT only that but researchers can register new datasets and upload them as well and each dataset they have they do NOT want to share with other researches necessarily so we have security groups and each CF belongs to security groups. We dynamically create CF's on the fly as people register new datasets. On top of that, when the data sets get too large, we probably want to partition a single CF into time partitions. We could create one CF and put all the data and have a partition per device, but then a time partition will contain multiple devices of data meaning we need to shrink our time partition size where if we have CF per device, the time partition can be larger as it is only for that one device. THEN, on top of that, we have a meta CF for these devices so some people want to query for streams that match criteria AND which returns a CF name and they query that CF name so we almost need a query with variables like select cfName from Meta where x = y and then select * from cfName where x. Which we can do today. Dean From: Marcelo Elias Del Valle mvall...@gmail.commailto:mvall...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, September 27, 2012 8:01 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: 1000's of column families Out of curiosity, is it really necessary to have that amount of CFs? I am probably still used to relational databases, where you would use a new table just in case you need to store different kinds of data. As Cassandra stores anything in each CF, it might probably make sense to have a lot of CFs to store your data... But why wouldn't you use a single CF with partitions in these case? Wouldn't it be the same thing? I am asking because I might learn a new modeling technique with the answer. []s 2012/9/26 Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov We are streaming data with 1 stream per 1 CF and we have 1000's of CF. When using the tools they are all geared to analyzing ONE column family at a time :(. If I remember correctly, Cassandra supports as many CF's as you want, correct? Even though I am going to have tons of funs with limitations on the tools, correct? (I may end up wrapping the node tool with my own aggregate calls if needed to sum up multiple column families and such). Thanks, Dean -- Marcelo Elias Del Valle http://mvalle.com
Re: 1000's of column families
Well, I am now thinking of adding a virtual capability to PlayOrm which we currently use to allow grouping entities into one column family. Right now the CF creation comes from a single entity so this then may change for those entities that define they are in a single CF groupŠ.This should not be a very hard change if we decide to do that. This makes us rely even more on PlayOrm's command line tool(instead of cassandra-cli) as I can't stand reading hex all the time nor do I like switching my assume validator to utf8 to decimal, to integer just so I can read stuff. Later, Dean On 10/1/12 9:22 AM, Brian O'Neill b...@alumni.brown.edu wrote: Dean, We have the same question... We have thousands of separate feeds of data as well (20,000+). To date, we've been using a CF per feed strategy, but as we scale this thing out to accommodate all of those feeds, we're trying to figure out if we're going to blow out the memory. The initial documentation for heap sizing had column families in the equation: http://www.datastax.com/docs/0.7/operations/tuning#heap-sizing But in the more recent documentation, it looks like they removed the column family variable with the introduction of the universal key_cache_size. http://www.datastax.com/docs/1.0/operations/tuning#tuning-java-heap-size We haven't committed either way yet, but given Ed Anuff's presentation on virtual keyspaces, we were leaning towards a single column family approach: http://blog.apigee.com/detail/building_a_mobile_data_platform_with_cassand ra_-_apigee_under_the_hood/? Definitely let us know what you decide. -brian On Fri, Sep 28, 2012 at 11:48 AM, Flavio Baronti f.baro...@list-group.com wrote: We had some serious trouble with dynamically adding CFs, although last time we tried we were using version 0.7, so maybe that's not an issue any more. Our problems were two: - You are (were?) not supposed to add CFs concurrently. Since we had more servers talking to the same Cassandra cluster, we had to use distributed locks (Hazelcast) to avoid concurrency. - You must be very careful to add new CFs to different Cassandra nodes. If you do that fast enough, and the clocks of the two servers are skewed, you will severely compromise your schema (Cassandra will not understand in which order the updates must be applied). As I said, this applied to version 0.7, maybe current versions solved these problems. Flavio Il 2012/09/27 16:11 PM, Hiller, Dean ha scritto: We have 1000's of different building devices and we stream data from these devices. The format and data from each one varies so one device has temperature at timeX with some other variables, another device has CO2 percentage and other variables. Every device is unique and streams it's own data. We dynamically discover devices and register them. Basically, one CF or table per thing really makes sense in this environment. While we could try to find out which devices are similar, this would really be a pain and some devices add some new variable into the equation. NOT only that but researchers can register new datasets and upload them as well and each dataset they have they do NOT want to share with other researches necessarily so we have security groups and each CF belongs to security groups. We dynamically create CF's on the fly as people register new datasets. On top of that, when the data sets get too large, we probably want to partition a single CF into time partitions. We could create one CF and put all the data and have a partition per device, but then a time partition will contain multiple devices of data meaning we need to shrink our time partition size where if we have CF per device, the time partition can be larger as it is only for that one device. THEN, on top of that, we have a meta CF for these devices so some people want to query for streams that match criteria AND which returns a CF name and they query that CF name so we almost need a query with variables like select cfName from Meta where x = y and then select * from cfName where x. Which we can do today. Dean From: Marcelo Elias Del Valle mvall...@gmail.commailto:mvall...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, September 27, 2012 8:01 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: 1000's of column families Out of curiosity, is it really necessary to have that amount of CFs? I am probably still used to relational databases, where you would use a new table just in case you need to store different kinds of data. As Cassandra stores anything in each CF, it might probably make sense to have a lot of CFs to store your data... But why wouldn't you use a single CF with partitions in these case? Wouldn't it be the same thing? I am asking because I might learn a new modeling
Re: 1000's of column families
Brian, On Mon, Oct 1, 2012 at 4:22 PM, Brian O'Neill b...@alumni.brown.edu wrote: We haven't committed either way yet, but given Ed Anuff's presentation on virtual keyspaces, we were leaning towards a single column family approach: http://blog.apigee.com/detail/building_a_mobile_data_platform_with_cassandra_-_apigee_under_the_hood/? Is this doing something special or is this just a convenience way of prefixing keys to make the storage space multi-tenanted? Cheers, Ben
Re: 1000's of column families
Its just a convenient way of prefixing: http://hector-client.github.com/hector/build/html/content/virtual_keyspaces.html -brian On Mon, Oct 1, 2012 at 4:22 PM, Ben Hood 0x6e6...@gmail.com wrote: Brian, On Mon, Oct 1, 2012 at 4:22 PM, Brian O'Neill b...@alumni.brown.edu wrote: We haven't committed either way yet, but given Ed Anuff's presentation on virtual keyspaces, we were leaning towards a single column family approach: http://blog.apigee.com/detail/building_a_mobile_data_platform_with_cassandra_-_apigee_under_the_hood/? Is this doing something special or is this just a convenience way of prefixing keys to make the storage space multi-tenanted? Cheers, Ben -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42
Re: 1000's of column families
On Mon, Oct 1, 2012 at 9:38 PM, Brian O'Neill b...@alumni.brown.edu wrote: Its just a convenient way of prefixing: http://hector-client.github.com/hector/build/html/content/virtual_keyspaces.html So given that it is possible to use a CF per tenant, should we assume that there at sufficient scale that there is less overhead to prefix keys than there is to manage multiple CFs? Ben
Re: 1000's of column families
I thought someone was saying each column family added to RAM on every node not RAM on a single node. It adds RAM on every node??? So eventually, I will run out? Was that person wrong? This would mean adding nodes does not help if he is right. Can anyone confirm this? Thanks, Dean From: Robin Verlangen ro...@us2.nlmailto:ro...@us2.nl Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, September 27, 2012 11:52 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: 1000's of column families so if you add up all the applications which would be huge and then all the tables which is large, it just keeps growing. It is a very nice concept(all data in one location), though we will see how implementing it goes. This shouldn't be a real problem for Cassandra. Just add more nodes and ever node contains a smaller piece of the cake (~ring). Best regards, Robin Verlangen Software engineer W http://www.robinverlangen.nl E ro...@us2.nlmailto:ro...@us2.nl [http://static.cloudpelican.com/images/CloudPelican-email-signature.jpg]http://goo.gl/Lt7BC Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. 2012/9/27 Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov Unfortunately, the security aspect is very strict. Some make their data public but there are many projects where due to client contracts, they cannot make their data public within our company(ie. Other groups in our company are not allowed to see the data). Also, currently, we have researchers upload their own datasets as well. Ideally, they want to see this one noSQL store as the place where all data for the company livesŠALL of it so if you add up all the applications which would be huge and then all the tables which is large, it just keeps growing. It is a very nice concept(all data in one location), though we will see how implementing it goes. How much overhead per column family in RAM? So far we have around 4000 Cfs with no issue that I see yet. Dean On 9/27/12 11:10 AM, Aaron Turner synfina...@gmail.commailto:synfina...@gmail.com wrote: On Thu, Sep 27, 2012 at 3:11 PM, Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov wrote: We have 1000's of different building devices and we stream data from these devices. The format and data from each one varies so one device has temperature at timeX with some other variables, another device has CO2 percentage and other variables. Every device is unique and streams it's own data. We dynamically discover devices and register them. Basically, one CF or table per thing really makes sense in this environment. While we could try to find out which devices are similar, this would really be a pain and some devices add some new variable into the equation. NOT only that but researchers can register new datasets and upload them as well and each dataset they have they do NOT want to share with other researches necessarily so we have security groups and each CF belongs to security groups. We dynamically create CF's on the fly as people register new datasets. On top of that, when the data sets get too large, we probably want to partition a single CF into time partitions. We could create one CF and put all the data and have a partition per device, but then a time partition will contain multiple devices of data meaning we need to shrink our time partition size where if we have CF per device, the time partition can be larger as it is only for that one device. THEN, on top of that, we have a meta CF for these devices so some people want to query for streams that match criteria AND which returns a CF name and they query that CF name so we almost need a query with variables like select cfName from Meta where x = y and then select * from cfName where x. Which we can do today. How strict are your security requirements? If it wasn't for that, you'd be much better off storing data on a per-statistic basis then per-device. Hell, you could store everything in a single CF by using a composite row key: devicename|stat type|instance But yeah, there isn't a hard limit for the number of CF's, but there is overhead associated with each one and so I wouldn't consider your design as scalable. Generally speaking, hundreds are ok, but thousands is pushing it. -- Aaron Turner http://synfin.net/ Twitter: @synfinatic http://tcpreplay.synfin.net/ - Pcap editing and replay tools
Re: 1000's of column families
I think I misunderstood your all data in one location note. I thought you meant to store it all in one CF. Best regards, Robin Verlangen *Software engineer* * * W http://www.robinverlangen.nl E ro...@us2.nl http://goo.gl/Lt7BC Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. 2012/9/28 Hiller, Dean dean.hil...@nrel.gov I thought someone was saying each column family added to RAM on every node not RAM on a single node. It adds RAM on every node??? So eventually, I will run out? Was that person wrong? This would mean adding nodes does not help if he is right. Can anyone confirm this? Thanks, Dean From: Robin Verlangen ro...@us2.nlmailto:ro...@us2.nl Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, September 27, 2012 11:52 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: 1000's of column families so if you add up all the applications which would be huge and then all the tables which is large, it just keeps growing. It is a very nice concept(all data in one location), though we will see how implementing it goes. This shouldn't be a real problem for Cassandra. Just add more nodes and ever node contains a smaller piece of the cake (~ring). Best regards, Robin Verlangen Software engineer W http://www.robinverlangen.nl E ro...@us2.nlmailto:ro...@us2.nl [http://static.cloudpelican.com/images/CloudPelican-email-signature.jpg] http://goo.gl/Lt7BC Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. 2012/9/27 Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov Unfortunately, the security aspect is very strict. Some make their data public but there are many projects where due to client contracts, they cannot make their data public within our company(ie. Other groups in our company are not allowed to see the data). Also, currently, we have researchers upload their own datasets as well. Ideally, they want to see this one noSQL store as the place where all data for the company livesŠALL of it so if you add up all the applications which would be huge and then all the tables which is large, it just keeps growing. It is a very nice concept(all data in one location), though we will see how implementing it goes. How much overhead per column family in RAM? So far we have around 4000 Cfs with no issue that I see yet. Dean On 9/27/12 11:10 AM, Aaron Turner synfina...@gmail.commailto: synfina...@gmail.com wrote: On Thu, Sep 27, 2012 at 3:11 PM, Hiller, Dean dean.hil...@nrel.gov mailto:dean.hil...@nrel.gov wrote: We have 1000's of different building devices and we stream data from these devices. The format and data from each one varies so one device has temperature at timeX with some other variables, another device has CO2 percentage and other variables. Every device is unique and streams it's own data. We dynamically discover devices and register them. Basically, one CF or table per thing really makes sense in this environment. While we could try to find out which devices are similar, this would really be a pain and some devices add some new variable into the equation. NOT only that but researchers can register new datasets and upload them as well and each dataset they have they do NOT want to share with other researches necessarily so we have security groups and each CF belongs to security groups. We dynamically create CF's on the fly as people register new datasets. On top of that, when the data sets get too large, we probably want to partition a single CF into time partitions. We could create one CF and put all the data and have a partition per device, but then a time partition will contain multiple devices of data meaning we need to shrink our time partition size where if we have CF per device, the time partition can be larger as it is only for that one device. THEN, on top of that, we have a meta CF for these devices so some people want to query for streams
Re: 1000's of column families
Yeah, you can't scale the number of CF's by adding new nodes to a cluster. You have to create multiple clusters. Anyways, I was thinking about your problem and the solution seems to be give each team/project their own CF and have them use composite row keys as I wrote about earlier. Yes that may mean you store the data for the same node multiple times, but that's pretty typical with Cassandra where you're de-normalizing your data to meet your query needs and Cassandra does seem to scale that way. Also, if you're not already using compression, you should. My experience with compression and time series data has been pretty amazing, especially with my CF's where I store a days worth of data in a single column as a vector. That gives you the best of both worlds: You get your per-team security and I'd assume (ha!) that would dramatically reduce the number of CF's you have to deal with since it's per-team/project and not per-device. On Fri, Sep 28, 2012 at 12:14 PM, Hiller, Dean dean.hil...@nrel.gov wrote: I thought someone was saying each column family added to RAM on every node not RAM on a single node. It adds RAM on every node??? So eventually, I will run out? Was that person wrong? This would mean adding nodes does not help if he is right. Can anyone confirm this? Thanks, Dean From: Robin Verlangen ro...@us2.nlmailto:ro...@us2.nl Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, September 27, 2012 11:52 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: 1000's of column families so if you add up all the applications which would be huge and then all the tables which is large, it just keeps growing. It is a very nice concept(all data in one location), though we will see how implementing it goes. This shouldn't be a real problem for Cassandra. Just add more nodes and ever node contains a smaller piece of the cake (~ring). Best regards, Robin Verlangen Software engineer W http://www.robinverlangen.nl E ro...@us2.nlmailto:ro...@us2.nl [http://static.cloudpelican.com/images/CloudPelican-email-signature.jpg]http://goo.gl/Lt7BC Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. 2012/9/27 Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov Unfortunately, the security aspect is very strict. Some make their data public but there are many projects where due to client contracts, they cannot make their data public within our company(ie. Other groups in our company are not allowed to see the data). Also, currently, we have researchers upload their own datasets as well. Ideally, they want to see this one noSQL store as the place where all data for the company livesŠALL of it so if you add up all the applications which would be huge and then all the tables which is large, it just keeps growing. It is a very nice concept(all data in one location), though we will see how implementing it goes. How much overhead per column family in RAM? So far we have around 4000 Cfs with no issue that I see yet. Dean On 9/27/12 11:10 AM, Aaron Turner synfina...@gmail.commailto:synfina...@gmail.com wrote: On Thu, Sep 27, 2012 at 3:11 PM, Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov wrote: We have 1000's of different building devices and we stream data from these devices. The format and data from each one varies so one device has temperature at timeX with some other variables, another device has CO2 percentage and other variables. Every device is unique and streams it's own data. We dynamically discover devices and register them. Basically, one CF or table per thing really makes sense in this environment. While we could try to find out which devices are similar, this would really be a pain and some devices add some new variable into the equation. NOT only that but researchers can register new datasets and upload them as well and each dataset they have they do NOT want to share with other researches necessarily so we have security groups and each CF belongs to security groups. We dynamically create CF's on the fly as people register new datasets. On top of that, when the data sets get too large, we probably want to partition a single CF into time partitions. We could create one CF and put all the data and have a partition per device, but then a time partition will contain multiple devices of data meaning we need
Re: 1000's of column families
We had some serious trouble with dynamically adding CFs, although last time we tried we were using version 0.7, so maybe that's not an issue any more. Our problems were two: - You are (were?) not supposed to add CFs concurrently. Since we had more servers talking to the same Cassandra cluster, we had to use distributed locks (Hazelcast) to avoid concurrency. - You must be very careful to add new CFs to different Cassandra nodes. If you do that fast enough, and the clocks of the two servers are skewed, you will severely compromise your schema (Cassandra will not understand in which order the updates must be applied). As I said, this applied to version 0.7, maybe current versions solved these problems. Flavio Il 2012/09/27 16:11 PM, Hiller, Dean ha scritto: We have 1000's of different building devices and we stream data from these devices. The format and data from each one varies so one device has temperature at timeX with some other variables, another device has CO2 percentage and other variables. Every device is unique and streams it's own data. We dynamically discover devices and register them. Basically, one CF or table per thing really makes sense in this environment. While we could try to find out which devices are similar, this would really be a pain and some devices add some new variable into the equation. NOT only that but researchers can register new datasets and upload them as well and each dataset they have they do NOT want to share with other researches necessarily so we have security groups and each CF belongs to security groups. We dynamically create CF's on the fly as people register new datasets. On top of that, when the data sets get too large, we probably want to partition a single CF into time partitions. We could create one CF and put all the data and have a partition per device, but then a time partition will contain multiple devices of data meaning we need to shrink our time partition size where if we have CF per device, the time partition can be larger as it is only for that one device. THEN, on top of that, we have a meta CF for these devices so some people want to query for streams that match criteria AND which returns a CF name and they query that CF name so we almost need a query with variables like select cfName from Meta where x = y and then select * from cfName where x. Which we can do today. Dean From: Marcelo Elias Del Valle mvall...@gmail.commailto:mvall...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, September 27, 2012 8:01 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: 1000's of column families Out of curiosity, is it really necessary to have that amount of CFs? I am probably still used to relational databases, where you would use a new table just in case you need to store different kinds of data. As Cassandra stores anything in each CF, it might probably make sense to have a lot of CFs to store your data... But why wouldn't you use a single CF with partitions in these case? Wouldn't it be the same thing? I am asking because I might learn a new modeling technique with the answer. []s 2012/9/26 Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov We are streaming data with 1 stream per 1 CF and we have 1000's of CF. When using the tools they are all geared to analyzing ONE column family at a time :(. If I remember correctly, Cassandra supports as many CF's as you want, correct? Even though I am going to have tons of funs with limitations on the tools, correct? (I may end up wrapping the node tool with my own aggregate calls if needed to sum up multiple column families and such). Thanks, Dean -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
Re: 1000's of column families
On Thu, Sep 27, 2012 at 12:13 AM, Hiller, Dean dean.hil...@nrel.gov wrote: We are streaming data with 1 stream per 1 CF and we have 1000's of CF. When using the tools they are all geared to analyzing ONE column family at a time :(. If I remember correctly, Cassandra supports as many CF's as you want, correct? Even though I am going to have tons of funs with limitations on the tools, correct? (I may end up wrapping the node tool with my own aggregate calls if needed to sum up multiple column families and such). Is there a non rhetorical question in there? Maybe is that a feature request in disguise? -- Sylvain
Re: 1000's of column families
Every CF adds some overhead (in memory) to each node. This is something you should really keep in mind. Best regards, Robin Verlangen *Software engineer* * * W http://www.robinverlangen.nl E ro...@us2.nl http://goo.gl/Lt7BC Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. 2012/9/27 Sylvain Lebresne sylv...@datastax.com On Thu, Sep 27, 2012 at 12:13 AM, Hiller, Dean dean.hil...@nrel.gov wrote: We are streaming data with 1 stream per 1 CF and we have 1000's of CF. When using the tools they are all geared to analyzing ONE column family at a time :(. If I remember correctly, Cassandra supports as many CF's as you want, correct? Even though I am going to have tons of funs with limitations on the tools, correct? (I may end up wrapping the node tool with my own aggregate calls if needed to sum up multiple column families and such). Is there a non rhetorical question in there? Maybe is that a feature request in disguise? -- Sylvain
Re: 1000's of column families
Is there a non rhetorical question in there? Maybe is that a feature request in disguise? The question was basically, Is Cassandra ok with as many CF's as you want? It sounds like it is not based on the email that every CF causes a bit more RAM to be used though. So if cassandra is not ok with as many CF's as you want, does anyone know what that limit would be for 16G of RAM or something I could calculate with? Thanks, Dean On 9/27/12 2:37 AM, Sylvain Lebresne sylv...@datastax.com wrote: On Thu, Sep 27, 2012 at 12:13 AM, Hiller, Dean dean.hil...@nrel.gov wrote: We are streaming data with 1 stream per 1 CF and we have 1000's of CF. When using the tools they are all geared to analyzing ONE column family at a time :(. If I remember correctly, Cassandra supports as many CF's as you want, correct? Even though I am going to have tons of funs with limitations on the tools, correct? (I may end up wrapping the node tool with my own aggregate calls if needed to sum up multiple column families and such). Is there a non rhetorical question in there? Maybe is that a feature request in disguise? -- Sylvain
Re: 1000's of column families
Out of curiosity, is it really necessary to have that amount of CFs? I am probably still used to relational databases, where you would use a new table just in case you need to store different kinds of data. As Cassandra stores anything in each CF, it might probably make sense to have a lot of CFs to store your data... But why wouldn't you use a single CF with partitions in these case? Wouldn't it be the same thing? I am asking because I might learn a new modeling technique with the answer. []s 2012/9/26 Hiller, Dean dean.hil...@nrel.gov We are streaming data with 1 stream per 1 CF and we have 1000's of CF. When using the tools they are all geared to analyzing ONE column family at a time :(. If I remember correctly, Cassandra supports as many CF's as you want, correct? Even though I am going to have tons of funs with limitations on the tools, correct? (I may end up wrapping the node tool with my own aggregate calls if needed to sum up multiple column families and such). Thanks, Dean -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
Re: 1000's of column families
We have 1000's of different building devices and we stream data from these devices. The format and data from each one varies so one device has temperature at timeX with some other variables, another device has CO2 percentage and other variables. Every device is unique and streams it's own data. We dynamically discover devices and register them. Basically, one CF or table per thing really makes sense in this environment. While we could try to find out which devices are similar, this would really be a pain and some devices add some new variable into the equation. NOT only that but researchers can register new datasets and upload them as well and each dataset they have they do NOT want to share with other researches necessarily so we have security groups and each CF belongs to security groups. We dynamically create CF's on the fly as people register new datasets. On top of that, when the data sets get too large, we probably want to partition a single CF into time partitions. We could create one CF and put all the data and have a partition per device, but then a time partition will contain multiple devices of data meaning we need to shrink our time partition size where if we have CF per device, the time partition can be larger as it is only for that one device. THEN, on top of that, we have a meta CF for these devices so some people want to query for streams that match criteria AND which returns a CF name and they query that CF name so we almost need a query with variables like select cfName from Meta where x = y and then select * from cfName where x. Which we can do today. Dean From: Marcelo Elias Del Valle mvall...@gmail.commailto:mvall...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, September 27, 2012 8:01 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: 1000's of column families Out of curiosity, is it really necessary to have that amount of CFs? I am probably still used to relational databases, where you would use a new table just in case you need to store different kinds of data. As Cassandra stores anything in each CF, it might probably make sense to have a lot of CFs to store your data... But why wouldn't you use a single CF with partitions in these case? Wouldn't it be the same thing? I am asking because I might learn a new modeling technique with the answer. []s 2012/9/26 Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov We are streaming data with 1 stream per 1 CF and we have 1000's of CF. When using the tools they are all geared to analyzing ONE column family at a time :(. If I remember correctly, Cassandra supports as many CF's as you want, correct? Even though I am going to have tons of funs with limitations on the tools, correct? (I may end up wrapping the node tool with my own aggregate calls if needed to sum up multiple column families and such). Thanks, Dean -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
Re: 1000's of column families
Dean, I was used, in the relational world, to use hibernate and O/R mapping. There were times when I used 3 classes (2 inheriting from 1 another) and mapped all of the to 1 table. The common part was in the super class and each sub class had it's own columns. The table, however, use to have all the columns and this design was hard because of that, as creating more subclasses would need changes in the table. However, if you use playOrm and if playOrm has/had a feature to allow inheritance mapping to a CF, it would solve your problem, wouldn't it? Of course it is probably much harder than it might problably appear... :D Best regards, Marcelo Valle. 2012/9/27 Hiller, Dean dean.hil...@nrel.gov We have 1000's of different building devices and we stream data from these devices. The format and data from each one varies so one device has temperature at timeX with some other variables, another device has CO2 percentage and other variables. Every device is unique and streams it's own data. We dynamically discover devices and register them. Basically, one CF or table per thing really makes sense in this environment. While we could try to find out which devices are similar, this would really be a pain and some devices add some new variable into the equation. NOT only that but researchers can register new datasets and upload them as well and each dataset they have they do NOT want to share with other researches necessarily so we have security groups and each CF belongs to security groups. We dynamically create CF's on the fly as people register new datasets. On top of that, when the data sets get too large, we probably want to partition a single CF into time partitions. We could create one CF and put all the data and have a partition per device, but then a time partition will contain multiple devices of data meaning we need to shrink our time partition size where if we have CF per device, the time partition can be larger as it is only for that one device. THEN, on top of that, we have a meta CF for these devices so some people want to query for streams that match criteria AND which returns a CF name and they query that CF name so we almost need a query with variables like select cfName from Meta where x = y and then select * from cfName where x. Which we can do today. Dean From: Marcelo Elias Del Valle mvall...@gmail.commailto: mvall...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, September 27, 2012 8:01 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: 1000's of column families Out of curiosity, is it really necessary to have that amount of CFs? I am probably still used to relational databases, where you would use a new table just in case you need to store different kinds of data. As Cassandra stores anything in each CF, it might probably make sense to have a lot of CFs to store your data... But why wouldn't you use a single CF with partitions in these case? Wouldn't it be the same thing? I am asking because I might learn a new modeling technique with the answer. []s 2012/9/26 Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov We are streaming data with 1 stream per 1 CF and we have 1000's of CF. When using the tools they are all geared to analyzing ONE column family at a time :(. If I remember correctly, Cassandra supports as many CF's as you want, correct? Even though I am going to have tons of funs with limitations on the tools, correct? (I may end up wrapping the node tool with my own aggregate calls if needed to sum up multiple column families and such). Thanks, Dean -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
Re: 1000's of column families
PlayOrm DOES support inheritance mapping but only supports single table right now. In fact, DboColumnMeta.java has 4 subclasses that all map to that one ColumnFamily so we already support and heavily use the inheritance feature. That said, I am more concerned with scalability. The more you stuff into a table, the more partitions you need….as an example, I really have a choice Have this in a partition device1 datapoint1 device2 datapoint1 device1 datapoint2 device2 datapoint2 device1 datapoint3 OR have just this in a partition device1 datapoint1 device1 datapoint1 device1 datapoint1 If I use the latter approach, I can have more points for device1 in one partition. I could use inheritance but then I can't fit as many data points for device 1 in a partition. Does that make more sense? Later, Dean From: Marcelo Elias Del Valle mvall...@gmail.commailto:mvall...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, September 27, 2012 8:45 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: 1000's of column families Dean, I was used, in the relational world, to use hibernate and O/R mapping. There were times when I used 3 classes (2 inheriting from 1 another) and mapped all of the to 1 table. The common part was in the super class and each sub class had it's own columns. The table, however, use to have all the columns and this design was hard because of that, as creating more subclasses would need changes in the table. However, if you use playOrm and if playOrm has/had a feature to allow inheritance mapping to a CF, it would solve your problem, wouldn't it? Of course it is probably much harder than it might problably appear... :D Best regards, Marcelo Valle. 2012/9/27 Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov We have 1000's of different building devices and we stream data from these devices. The format and data from each one varies so one device has temperature at timeX with some other variables, another device has CO2 percentage and other variables. Every device is unique and streams it's own data. We dynamically discover devices and register them. Basically, one CF or table per thing really makes sense in this environment. While we could try to find out which devices are similar, this would really be a pain and some devices add some new variable into the equation. NOT only that but researchers can register new datasets and upload them as well and each dataset they have they do NOT want to share with other researches necessarily so we have security groups and each CF belongs to security groups. We dynamically create CF's on the fly as people register new datasets. On top of that, when the data sets get too large, we probably want to partition a single CF into time partitions. We could create one CF and put all the data and have a partition per device, but then a time partition will contain multiple devices of data meaning we need to shrink our time partition size where if we have CF per device, the time partition can be larger as it is only for that one device. THEN, on top of that, we have a meta CF for these devices so some people want to query for streams that match criteria AND which returns a CF name and they query that CF name so we almost need a query with variables like select cfName from Meta where x = y and then select * from cfName where x. Which we can do today. Dean From: Marcelo Elias Del Valle mvall...@gmail.commailto:mvall...@gmail.commailto:mvall...@gmail.commailto:mvall...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, September 27, 2012 8:01 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: 1000's of column families Out of curiosity, is it really necessary to have that amount of CFs? I am probably still used to relational databases, where you would use a new table just in case you need to store different kinds of data. As Cassandra stores anything in each CF, it might probably make sense to have a lot of CFs to store your data... But why wouldn't you use a single CF with partitions in these case? Wouldn't it be the same thing? I am asking because I might learn a new modeling technique with the answer. []s 2012/9/26 Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.govmailto:dean.hil...@nrel.govmailto:dean.hil...@nrel.gov We are streaming data with 1 stream per
Re: 1000's of column families
Hector also offers support for 'Virtual Keyspaces' which you might want to look at. On Thu, Sep 27, 2012 at 1:10 PM, Aaron Turner synfina...@gmail.com wrote: On Thu, Sep 27, 2012 at 3:11 PM, Hiller, Dean dean.hil...@nrel.gov wrote: We have 1000's of different building devices and we stream data from these devices. The format and data from each one varies so one device has temperature at timeX with some other variables, another device has CO2 percentage and other variables. Every device is unique and streams it's own data. We dynamically discover devices and register them. Basically, one CF or table per thing really makes sense in this environment. While we could try to find out which devices are similar, this would really be a pain and some devices add some new variable into the equation. NOT only that but researchers can register new datasets and upload them as well and each dataset they have they do NOT want to share with other researches necessarily so we have security groups and each CF belongs to security groups. We dynamically create CF's on the fly as people register new datasets. On top of that, when the data sets get too large, we probably want to partition a single CF into time partitions. We could create one CF and put all the data and have a partition per device, but then a time partition will contain multiple devices of data meaning we need to shrink our time partition size where if we have CF per device, the time partition can be larger as it is only for that one device. THEN, on top of that, we have a meta CF for these devices so some people want to query for streams that match criteria AND which returns a CF name and they query that CF name so we almost need a query with variables like select cfName from Meta where x = y and then select * from cfName where x. Which we can do today. How strict are your security requirements? If it wasn't for that, you'd be much better off storing data on a per-statistic basis then per-device. Hell, you could store everything in a single CF by using a composite row key: devicename|stat type|instance But yeah, there isn't a hard limit for the number of CF's, but there is overhead associated with each one and so I wouldn't consider your design as scalable. Generally speaking, hundreds are ok, but thousands is pushing it. -- Aaron Turner http://synfin.net/ Twitter: @synfinatic http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix Windows Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. -- Benjamin Franklin carpe diem quam minimum credula postero
Re: 1000's of column families
On Thu, Sep 27, 2012 at 7:35 PM, Marcelo Elias Del Valle mvall...@gmail.com wrote: 2012/9/27 Aaron Turner synfina...@gmail.com How strict are your security requirements? If it wasn't for that, you'd be much better off storing data on a per-statistic basis then per-device. Hell, you could store everything in a single CF by using a composite row key: devicename|stat type|instance But yeah, there isn't a hard limit for the number of CF's, but there is overhead associated with each one and so I wouldn't consider your design as scalable. Generally speaking, hundreds are ok, but thousands is pushing it. Aaron, Imagine that instead of using a composite key in this case, you use a simple row key instance_uuid. Then, to index data by devicename | start_type|instance you use another CF with this composite key or several CFs to index it. Do you see any drawbacks in terms of performance? Really that depends on the client side I think. Ideally, you'd like to have the client be able to be able to directly access the row by name without looking it up in some index. Basically if you have to lookup up the instance_uuid that's another call to some datastore which takes more time then generating the row key via it's composites. At least that's my opinion... Of course there are times where using an instance_uuid makes a lot of sense... like if you rename a device and want all your stats to move to the new name. Much easier to just update the mapping record then reading rewriting all your rows for that device. In my project, we use a device_uuid (just a primary key stored in an Oracle DB... long story!), but everything else is by name in our composite row keys. -- Aaron Turner http://synfin.net/ Twitter: @synfinatic http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix Windows Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. -- Benjamin Franklin carpe diem quam minimum credula postero
Re: 1000's of column families
Unfortunately, the security aspect is very strict. Some make their data public but there are many projects where due to client contracts, they cannot make their data public within our company(ie. Other groups in our company are not allowed to see the data). Also, currently, we have researchers upload their own datasets as well. Ideally, they want to see this one noSQL store as the place where all data for the company livesŠALL of it so if you add up all the applications which would be huge and then all the tables which is large, it just keeps growing. It is a very nice concept(all data in one location), though we will see how implementing it goes. How much overhead per column family in RAM? So far we have around 4000 Cfs with no issue that I see yet. Dean On 9/27/12 11:10 AM, Aaron Turner synfina...@gmail.com wrote: On Thu, Sep 27, 2012 at 3:11 PM, Hiller, Dean dean.hil...@nrel.gov wrote: We have 1000's of different building devices and we stream data from these devices. The format and data from each one varies so one device has temperature at timeX with some other variables, another device has CO2 percentage and other variables. Every device is unique and streams it's own data. We dynamically discover devices and register them. Basically, one CF or table per thing really makes sense in this environment. While we could try to find out which devices are similar, this would really be a pain and some devices add some new variable into the equation. NOT only that but researchers can register new datasets and upload them as well and each dataset they have they do NOT want to share with other researches necessarily so we have security groups and each CF belongs to security groups. We dynamically create CF's on the fly as people register new datasets. On top of that, when the data sets get too large, we probably want to partition a single CF into time partitions. We could create one CF and put all the data and have a partition per device, but then a time partition will contain multiple devices of data meaning we need to shrink our time partition size where if we have CF per device, the time partition can be larger as it is only for that one device. THEN, on top of that, we have a meta CF for these devices so some people want to query for streams that match criteria AND which returns a CF name and they query that CF name so we almost need a query with variables like select cfName from Meta where x = y and then select * from cfName where x. Which we can do today. How strict are your security requirements? If it wasn't for that, you'd be much better off storing data on a per-statistic basis then per-device. Hell, you could store everything in a single CF by using a composite row key: devicename|stat type|instance But yeah, there isn't a hard limit for the number of CF's, but there is overhead associated with each one and so I wouldn't consider your design as scalable. Generally speaking, hundreds are ok, but thousands is pushing it. -- Aaron Turner http://synfin.net/ Twitter: @synfinatic http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix Windows Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. -- Benjamin Franklin carpe diem quam minimum credula postero
Re: 1000's of column families
so if you add up all the applications which would be huge and then all the tables which is large, it just keeps growing. It is a very nice concept(all data in one location), though we will see how implementing it goes. This shouldn't be a real problem for Cassandra. Just add more nodes and ever node contains a smaller piece of the cake (~ring). Best regards, Robin Verlangen *Software engineer* * * W http://www.robinverlangen.nl E ro...@us2.nl http://goo.gl/Lt7BC Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. 2012/9/27 Hiller, Dean dean.hil...@nrel.gov Unfortunately, the security aspect is very strict. Some make their data public but there are many projects where due to client contracts, they cannot make their data public within our company(ie. Other groups in our company are not allowed to see the data). Also, currently, we have researchers upload their own datasets as well. Ideally, they want to see this one noSQL store as the place where all data for the company livesŠALL of it so if you add up all the applications which would be huge and then all the tables which is large, it just keeps growing. It is a very nice concept(all data in one location), though we will see how implementing it goes. How much overhead per column family in RAM? So far we have around 4000 Cfs with no issue that I see yet. Dean On 9/27/12 11:10 AM, Aaron Turner synfina...@gmail.com wrote: On Thu, Sep 27, 2012 at 3:11 PM, Hiller, Dean dean.hil...@nrel.gov wrote: We have 1000's of different building devices and we stream data from these devices. The format and data from each one varies so one device has temperature at timeX with some other variables, another device has CO2 percentage and other variables. Every device is unique and streams it's own data. We dynamically discover devices and register them. Basically, one CF or table per thing really makes sense in this environment. While we could try to find out which devices are similar, this would really be a pain and some devices add some new variable into the equation. NOT only that but researchers can register new datasets and upload them as well and each dataset they have they do NOT want to share with other researches necessarily so we have security groups and each CF belongs to security groups. We dynamically create CF's on the fly as people register new datasets. On top of that, when the data sets get too large, we probably want to partition a single CF into time partitions. We could create one CF and put all the data and have a partition per device, but then a time partition will contain multiple devices of data meaning we need to shrink our time partition size where if we have CF per device, the time partition can be larger as it is only for that one device. THEN, on top of that, we have a meta CF for these devices so some people want to query for streams that match criteria AND which returns a CF name and they query that CF name so we almost need a query with variables like select cfName from Meta where x = y and then select * from cfName where x. Which we can do today. How strict are your security requirements? If it wasn't for that, you'd be much better off storing data on a per-statistic basis then per-device. Hell, you could store everything in a single CF by using a composite row key: devicename|stat type|instance But yeah, there isn't a hard limit for the number of CF's, but there is overhead associated with each one and so I wouldn't consider your design as scalable. Generally speaking, hundreds are ok, but thousands is pushing it. -- Aaron Turner http://synfin.net/ Twitter: @synfinatic http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix Windows Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. -- Benjamin Franklin carpe diem quam minimum credula postero
1000's of column families
We are streaming data with 1 stream per 1 CF and we have 1000's of CF. When using the tools they are all geared to analyzing ONE column family at a time :(. If I remember correctly, Cassandra supports as many CF's as you want, correct? Even though I am going to have tons of funs with limitations on the tools, correct? (I may end up wrapping the node tool with my own aggregate calls if needed to sum up multiple column families and such). Thanks, Dean