subject:"1000's of column families"

Ben, 
  to address your question, read my last post but to summarize, yes, there
is less overhead in memory to prefix keys than manage multiple Cfs EXCEPT
when doing map/reduce.  Doing map/reduce, you will now have HUGE overhead
in reading a whole slew of rows you don't care about as you can't
map/reduce a single virtual CF but must map/reduce the whole CF wasting
TONS of resources.

Thanks,
Dean

On 10/1/12 3:38 PM, Ben Hood 0x6e6...@gmail.com wrote:

On Mon, Oct 1, 2012 at 9:38 PM, Brian O'Neill b...@alumni.brown.edu
wrote:
 Its just a convenient way of prefixing:
 
http://hector-client.github.com/hector/build/html/content/virtual_keyspac
es.html

So given that it is possible to use a CF per tenant, should we assume
that there at sufficient scale that there is less overhead to prefix
keys than there is to manage multiple CFs?

Ben

Re: 1000's of column families

2012-10-02 Thread Brian O'Neill


Without putting too much thought into it...

Given the underlying architecture, I think you could/would have to write
your own partitioner, which would partition based on the prefix/virtual
keyspace.  

-brian

---
Brian O'Neill
Lead Architect, Software Development
 
Health Market Science
The Science of Better Results
2700 Horizon Drive  King of Prussia, PA  19406
M: 215.588.6024  @boneill42 http://www.twitter.com/boneill42  
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 10/2/12 9:00 AM, Ben Hood 0x6e6...@gmail.com wrote:

Dean,

On Tue, Oct 2, 2012 at 1:37 PM, Hiller, Dean dean.hil...@nrel.gov wrote:
 Ben,
   to address your question, read my last post but to summarize, yes,
there
 is less overhead in memory to prefix keys than manage multiple Cfs
EXCEPT
 when doing map/reduce.  Doing map/reduce, you will now have HUGE
overhead
 in reading a whole slew of rows you don't care about as you can't
 map/reduce a single virtual CF but must map/reduce the whole CF wasting
 TONS of resources.

That's a good point that I hadn't considered beforehand, especially as
I'd like to run MR jobs against these CFs.

Is this limitation inherent in the way that Cassandra is modelled as
input for Hadoop or could you write a custom slice query to only feed
in one particular prefix into Hadoop?

Cheers,

Ben

Re: 1000's of column families

Thanks for the idea but…(but please keep thinking on it)...

100% what we don't want since partitioned data resides on the same node.
I want to map/reduce the column families and leverage the parallel disks

:( :(

I am sure others would want to do the same…..We almost need a feature of
virtual Column Families and column family should really not be column
family but should be called ReplicationGroup or something where
replication is configured for all CF's in that group.

ANYONE have any other ideas???

Dean

On 10/2/12 7:20 AM, Brian O'Neill boneil...@gmail.com wrote:


Without putting too much thought into it...

Given the underlying architecture, I think you could/would have to write
your own partitioner, which would partition based on the prefix/virtual
keyspace.  

-brian

---
Brian O'Neill
Lead Architect, Software Development
 
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42  €
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 10/2/12 9:00 AM, Ben Hood 0x6e6...@gmail.com wrote:

Dean,

On Tue, Oct 2, 2012 at 1:37 PM, Hiller, Dean dean.hil...@nrel.gov
wrote:
 Ben,
   to address your question, read my last post but to summarize, yes,
there
 is less overhead in memory to prefix keys than manage multiple Cfs
EXCEPT
 when doing map/reduce.  Doing map/reduce, you will now have HUGE
overhead
 in reading a whole slew of rows you don't care about as you can't
 map/reduce a single virtual CF but must map/reduce the whole CF wasting
 TONS of resources.

That's a good point that I hadn't considered beforehand, especially as
I'd like to run MR jobs against these CFs.

Is this limitation inherent in the way that Cassandra is modelled as
input for Hadoop or could you write a custom slice query to only feed
in one particular prefix into Hadoop?

Cheers,

Ben

Re: 1000's of column families

2012-10-02 Thread Brian O'Neill


Agreed. 

Do we know yet what the overhead is for each column family?  What is the
limit?
If you have a SINGLE keyspace w/ 2+ CF's, what happens?  Anyone know?

-brian


---
Brian O'Neill
Lead Architect, Software Development
 
Health Market Science
The Science of Better Results
2700 Horizon Drive • King of Prussia, PA • 19406
M: 215.588.6024 • @boneill42 http://www.twitter.com/boneill42  •
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 10/2/12 9:28 AM, Hiller, Dean dean.hil...@nrel.gov wrote:

Thanks for the idea but…(but please keep thinking on it)...

100% what we don't want since partitioned data resides on the same node.
I want to map/reduce the column families and leverage the parallel disks

:( :(

I am sure others would want to do the same…..We almost need a feature of
virtual Column Families and column family should really not be column
family but should be called ReplicationGroup or something where
replication is configured for all CF's in that group.

ANYONE have any other ideas???

Dean

On 10/2/12 7:20 AM, Brian O'Neill boneil...@gmail.com wrote:


Without putting too much thought into it...

Given the underlying architecture, I think you could/would have to write
your own partitioner, which would partition based on the prefix/virtual
keyspace.  

-brian

---
Brian O'Neill
Lead Architect, Software Development
 
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42  €
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material.
If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 10/2/12 9:00 AM, Ben Hood 0x6e6...@gmail.com wrote:

Dean,

On Tue, Oct 2, 2012 at 1:37 PM, Hiller, Dean dean.hil...@nrel.gov
wrote:
 Ben,
   to address your question, read my last post but to summarize, yes,
there
 is less overhead in memory to prefix keys than manage multiple Cfs
EXCEPT
 when doing map/reduce.  Doing map/reduce, you will now have HUGE
overhead
 in reading a whole slew of rows you don't care about as you can't
 map/reduce a single virtual CF but must map/reduce the whole CF
wasting
 TONS of resources.

That's a good point that I hadn't considered beforehand, especially as
I'd like to run MR jobs against these CFs.

Is this limitation inherent in the way that Cassandra is modelled as
input for Hadoop or could you write a custom slice query to only feed
in one particular prefix into Hadoop?

Cheers,

Ben

Re: 1000's of column families

Brian,

On Tue, Oct 2, 2012 at 2:20 PM, Brian O'Neill boneil...@gmail.com wrote:

 Without putting too much thought into it...

 Given the underlying architecture, I think you could/would have to write
 your own partitioner, which would partition based on the prefix/virtual
 keyspace.

I might be barking up the wrong tree here, but looking at source of
ColumnFamilyInputFormat, it seems that you can specify a KeyRange for
the input, but only when you use an order preserving partitioner. So I
presume that if you are using the RandomPartitioner, you are
effectively doing a full CF scan (i.e. including all tenants in your
system).

Ben

Re: 1000's of column families

2012-10-02 Thread Brian O'Neill

Exactly.

---
Brian O'Neill
Lead Architect, Software Development
 
Health Market Science
The Science of Better Results
2700 Horizon Drive  King of Prussia, PA  19406
M: 215.588.6024  @boneill42 http://www.twitter.com/boneill42  
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 10/2/12 9:55 AM, Ben Hood 0x6e6...@gmail.com wrote:

Brian,

On Tue, Oct 2, 2012 at 2:20 PM, Brian O'Neill boneil...@gmail.com wrote:

 Without putting too much thought into it...

 Given the underlying architecture, I think you could/would have to write
 your own partitioner, which would partition based on the prefix/virtual
 keyspace.

I might be barking up the wrong tree here, but looking at source of
ColumnFamilyInputFormat, it seems that you can specify a KeyRange for
the input, but only when you use an order preserving partitioner. So I
presume that if you are using the RandomPartitioner, you are
effectively doing a full CF scan (i.e. including all tenants in your
system).

Ben

Re: 1000's of column families

On Tue, Oct 2, 2012 at 3:37 PM, Brian O'Neill boneil...@gmail.com wrote:
 Exactly.

So you're back to the deliberation between using multiple CFs
(potentially with some known working upper bound*) or feeding your map
reduce in some other way (as you decided to do with Storm). In my
particular scenario I'd like to be able to do a combination of some
batch processing on top of less frequently changing data (hence why I
was looking at Hadoop) and some real time analytics.

Cheers,

Ben

(*) Not sure whether this applies to an individual keyspace or an
entire cluster.

Re: 1000's of column families

Ben, Brian, 
   By the way, PlayOrm offers a NoSqlTypedSession that is different than
the ORM half of PlayOrm dealing in raw stuff that does indexing(so you can
do Scalable SQL on data that has no ORM on top of it).  That is what we
use for our 1000's of CF's as we don't know the format of any of those
tables ahead of time(in our world, users tell us the format and wire in
streams through an api we expose AND they tell PlayOrm which columns to
index).  That layer deals with BigInteger, BigDecimal, String and I think
byte[].

So, I am going to add virtual CF's to PlayOrm in the coming week and we
are going to feed in streams and partition the virtual CF's which sit in a
single real CF using PlayOrm partitioning and then we can then query into
each partition.  

The only issue is really what partitions exist and that is left to the
client to keep track of, but if your app knows all the partitions(and that
could be saved to some rows in the nosql store), then I will probably try
out storm after that.

Later,
Dean

On 10/2/12 9:09 AM, Ben Hood 0x6e6...@gmail.com wrote:

On Tue, Oct 2, 2012 at 3:37 PM, Brian O'Neill boneil...@gmail.com wrote:
 Exactly.

So you're back to the deliberation between using multiple CFs
(potentially with some known working upper bound*) or feeding your map
reduce in some other way (as you decided to do with Storm). In my
particular scenario I'd like to be able to do a combination of some
batch processing on top of less frequently changing data (hence why I
was looking at Hadoop) and some real time analytics.

Cheers,

Ben

(*) Not sure whether this applies to an individual keyspace or an
entire cluster.

Re: 1000's of column families

2012-10-02 Thread Jeremy Hanna

Another option that may or may not work for you is the support in Cassandra 
1.1+ to use a secondary index as an input to your mapreduce job.  What you 
might do is add a field to the column family that represents which virtual 
column family that it is part of.  Then when doing mapreduce jobs, you could 
use that field as the secondary index limiter.  Secondary index mapreduce is 
not as efficient since you first get all of the keys and then do multigets to 
get the data that you need for the mapreduce job.  However, it's another option 
for not scanning the whole column family.

On Oct 2, 2012, at 10:09 AM, Ben Hood 0x6e6...@gmail.com wrote:

 On Tue, Oct 2, 2012 at 3:37 PM, Brian O'Neill boneil...@gmail.com wrote:
 Exactly.
 
 So you're back to the deliberation between using multiple CFs
 (potentially with some known working upper bound*) or feeding your map
 reduce in some other way (as you decided to do with Storm). In my
 particular scenario I'd like to be able to do a combination of some
 batch processing on top of less frequently changing data (hence why I
 was looking at Hadoop) and some real time analytics.
 
 Cheers,
 
 Ben
 
 (*) Not sure whether this applies to an individual keyspace or an
 entire cluster.

Re: 1000's of column families

Jeremy, 


On Tuesday, October 2, 2012 at 17:06, Jeremy Hanna wrote:

 Another option that may or may not work for you is the support in Cassandra 
 1.1+ to use a secondary index as an input to your mapreduce job. What you 
 might do is add a field to the column family that represents which virtual 
 column family that it is part of. Then when doing mapreduce jobs, you could 
 use that field as the secondary index limiter. Secondary index mapreduce is 
 not as efficient since you first get all of the keys and then do multigets to 
 get the data that you need for the mapreduce job. However, it's another 
 option for not scanning the whole column family.
 


Interesting. This is probably a stupid question but why shouldn't you be able 
to use the secondary index to go straight to the slices that belong to the 
attribute you are searching by? Is this something to do with the way Cassandra 
is exposed as an InputFormat for Hadoop or is this a general property for 
searching by secondary index?

Ben

Re: 1000's of column families

Because the data for an index is not all together(ie. Need a multi get to get 
the data).  It is not contiguous.

The prefix in a partition they keep the data so all data for a prefix from what 
I understand is contiguous.

QUESTION: What I don't get in the comment is I assume you are referring to CQL 
in which case we would need to specify the partition (in addition to the 
index)which means all that data is on one node, correct?  Or did I miss 
something there.

Thanks,
Dean

From: Ben Hood 0x6e6...@gmail.commailto:0x6e6...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Tuesday, October 2, 2012 11:18 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: 1000's of column families

Jeremy,

On Tuesday, October 2, 2012 at 17:06, Jeremy Hanna wrote:

Another option that may or may not work for you is the support in Cassandra 
1.1+ to use a secondary index as an input to your mapreduce job. What you might 
do is add a field to the column family that represents which virtual column 
family that it is part of. Then when doing mapreduce jobs, you could use that 
field as the secondary index limiter. Secondary index mapreduce is not as 
efficient since you first get all of the keys and then do multigets to get the 
data that you need for the mapreduce job. However, it's another option for not 
scanning the whole column family.

Interesting. This is probably a stupid question but why shouldn't you be able 
to use the secondary index to go straight to the slices that belong to the 
attribute you are searching by? Is this something to do with the way Cassandra 
is exposed as an InputFormat for Hadoop or is this a general property for 
searching by secondary index?

Ben

Re: 1000's of column families

Dean, 


On Tuesday, October 2, 2012 at 18:52, Hiller, Dean wrote:

 Because the data for an index is not all together(ie. Need a multi get to get 
 the data). It is not contiguous.
 
 The prefix in a partition they keep the data so all data for a prefix from 
 what I understand is contiguous.
 

So you're saying that you can access the primary index with a key range, but to 
access the secondary index, you first need to get all keys and follow up with a 
multiget, which would use the secondary index to speed the lookup of the 
matching rows?

 
 
 QUESTION: What I don't get in the comment is I assume you are referring to 
 CQL in which case we would need to specify the partition (in addition to the 
 index)which means all that data is on one node, correct? Or did I miss 
 something there.
 
 


Maybe my question was just silly - I wasn't referring to CQL.

As for the locality of the data, I was hoping to be able to fire off an MR job 
to process all matching rows in the CF - I was assuming that that this job 
would get executed on the same node as the data.

But I think the real confusion in my question has to do with the way the 
ColumnFamilyInputFormat has been implemented, since it would appear that it 
ingests the entire (non-OPP) CF into Hadoop, such that the predicate needs to 
be applied in the job rather than up front in the Cassandra query.

Cheers,

Ben

Re: 1000's of column families

So you're saying that you can access the primary index with a key range, but to 
access the secondary index, you first need to get all keys and follow up with a 
multiget, which would use the secondary index to speed the lookup of the 
matching rows?

Yes, that is how I believe it works.  I am by no means an expert.

I also wanted to fire off a MR to process matching rows in the virtual CF 
ideally running on the nodes where it reads data in.  In 0.7, I thought the M/R 
jobs did not run locally with the data like hadoop does???  Anyone know if that 
is still true or does it run locally to the data now?

Thanks,
Dean

From: Ben Hood 0x6e6...@gmail.commailto:0x6e6...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Tuesday, October 2, 2012 1:01 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: 1000's of column families

Dean,

On Tuesday, October 2, 2012 at 18:52, Hiller, Dean wrote:

Because the data for an index is not all together(ie. Need a multi get to get 
the data). It is not contiguous.

The prefix in a partition they keep the data so all data for a prefix from what 
I understand is contiguous.





QUESTION: What I don't get in the comment is I assume you are referring to CQL 
in which case we would need to specify the partition (in addition to the 
index)which means all that data is on one node, correct? Or did I miss 
something there.

Maybe my question was just silly - I wasn't referring to CQL.

As for the locality of the data, I was hoping to be able to fire off an MR job 
to process all matching rows in the CF - I was assuming that that this job 
would get executed on the same node as the data.

But I think the real confusion in my question has to do with the way the 
ColumnFamilyInputFormat has been implemented, since it would appear that it 
ingests the entire (non-OPP) CF into Hadoop, such that the predicate needs to 
be applied in the job rather than up front in the Cassandra query.

Cheers,

Ben

Re: 1000's of column families

2012-10-02 Thread Jeremy Hanna

It's always had data locality (since hadoop support was added in 0.6).

You don't need to specify a partition, you specify the input predicate with 
ConfigHelper or the cassandra.input.predicate property.

On Oct 2, 2012, at 2:26 PM, Hiller, Dean dean.hil...@nrel.gov wrote:

 So you're saying that you can access the primary index with a key range, but 
 to access the secondary index, you first need to get all keys and follow up 
 with a multiget, which would use the secondary index to speed the lookup of 
 the matching rows?
 
 Yes, that is how I believe it works.  I am by no means an expert.
 
 I also wanted to fire off a MR to process matching rows in the virtual CF 
 ideally running on the nodes where it reads data in.  In 0.7, I thought the 
 M/R jobs did not run locally with the data like hadoop does???  Anyone know 
 if that is still true or does it run locally to the data now?
 
 Thanks,
 Dean
 
 From: Ben Hood 0x6e6...@gmail.commailto:0x6e6...@gmail.com
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Tuesday, October 2, 2012 1:01 PM
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Subject: Re: 1000's of column families
 
 Dean,
 
 On Tuesday, October 2, 2012 at 18:52, Hiller, Dean wrote:
 
 Because the data for an index is not all together(ie. Need a multi get to get 
 the data). It is not contiguous.
 
 The prefix in a partition they keep the data so all data for a prefix from 
 what I understand is contiguous.
 
 
 
 
 
 QUESTION: What I don't get in the comment is I assume you are referring to 
 CQL in which case we would need to specify the partition (in addition to the 
 index)which means all that data is on one node, correct? Or did I miss 
 something there.
 
 Maybe my question was just silly - I wasn't referring to CQL.
 
 As for the locality of the data, I was hoping to be able to fire off an MR 
 job to process all matching rows in the CF - I was assuming that that this 
 job would get executed on the same node as the data.
 
 But I think the real confusion in my question has to do with the way the 
 ColumnFamilyInputFormat has been implemented, since it would appear that it 
 ingests the entire (non-OPP) CF into Hadoop, such that the predicate needs to 
 be applied in the job rather than up front in the Cassandra query.
 
 Cheers,
 
 Ben

Re: 1000's of column families

2012-10-01 Thread Brian O'Neill

Dean,

We have the same question...

We have thousands of separate feeds of data as well (20,000+). To
date, we've been using a CF per feed strategy, but as we scale this
thing out to accommodate all of those feeds, we're trying to figure
out if we're going to blow out the memory.

The initial documentation for heap sizing had column families in the equation:
http://www.datastax.com/docs/0.7/operations/tuning#heap-sizing

But in the more recent documentation, it looks like they removed the
column family variable with the introduction of the universal
key_cache_size.
http://www.datastax.com/docs/1.0/operations/tuning#tuning-java-heap-size

We haven't committed either way yet, but given Ed Anuff's presentation
on virtual keyspaces, we were leaning towards a single column family
approach:
http://blog.apigee.com/detail/building_a_mobile_data_platform_with_cassandra_-_apigee_under_the_hood/?

Definitely let us know what you decide.

-brian

On Fri, Sep 28, 2012 at 11:48 AM, Flavio Baronti
f.baro...@list-group.com wrote:
We had some serious trouble with dynamically adding CFs, although last time
we tried we were using version 0.7, so maybe
that's not an issue any more.
Our problems were two:
- You are (were?) not supposed to add CFs concurrently. Since we had more
servers talking to the same Cassandra cluster,
we had to use distributed locks (Hazelcast) to avoid concurrency.
- You must be very careful to add new CFs to different Cassandra nodes. If
you do that fast enough, and the clocks of
the two servers are skewed, you will severely compromise your schema
(Cassandra will not understand in which order the
updates must be applied).

As I said, this applied to version 0.7, maybe current versions solved these
problems.

Flavio

Il 2012/09/27 16:11 PM, Hiller, Dean ha scritto:
We have 1000's of different building devices and we stream data from these
devices. The format and data from each one varies so one device has
temperature
at timeX with some other variables, another device has CO2 percentage and
other
variables. Every device is unique and streams it's own data. We dynamically
discover devices and register them. Basically, one CF or table per thing
really
makes sense in this environment. While we could try to find out which devices
are similar, this would really be a pain and some devices add some new
variable into the equation. NOT only that but researchers can register new
datasets and upload them as well and each dataset they have they do NOT want
to
share with other researches necessarily so we have security groups and each CF
belongs to security groups. We dynamically create CF's on the fly as people
register new datasets.

On top of that, when the data sets get too large, we probably want to
partition a single CF into time partitions. We could create one CF and put
all
the data and have a partition per device, but then a time partition will
contain
multiple devices of data meaning we need to shrink our time partition size
where if we have CF per device, the time partition can be larger as it is only
for that one device.

THEN, on top of that, we have a meta CF for these devices so some people want
to query for streams that match criteria AND which returns a CF name and they
query that CF name so we almost need a query with variables like select cfName
from Meta where x = y and then select * from cfName where x. Which we can
do
today.

Dean

From: Marcelo Elias Del Valle mvall...@gmail.commailto:mvall...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, September 27, 2012 8:01 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: 1000's of column families

Out of curiosity, is it really necessary to have that amount of CFs?
I am probably still used to relational databases, where you would use a new
table just in case you need to store different kinds of data. As Cassandra
stores anything in each CF, it might probably make sense to have a lot of CFs
to
store your data...
But why wouldn't you use a single CF with partitions in these case? Wouldn't
it be the same thing? I am asking because I might learn a new modeling
technique
with the answer.

[]s

2012/9/26 Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov
We are streaming data with 1 stream per 1 CF and we have 1000's of CF. When
using the tools they are all geared to analyzing ONE column family at a time
:(.
If I remember correctly, Cassandra supports as many CF's as you want, correct?
Even though I am going to have tons of funs with limitations on the tools,
correct?

(I may end up wrapping the node tool with my own aggregate calls if needed to
sum up multiple column families and such).

Thanks,
Dean

--
Marcelo Elias Del Valle
http://mvalle.com

Re: 1000's of column families

2012-10-01 Thread Hiller, Dean

Well, I am now thinking of adding a virtual capability to PlayOrm which we
currently use to allow grouping entities into one column family.  Right
now the CF creation comes from a single entity so this then may change for
those entities that define they are in a single CF groupŠ.This should not
be a very hard change if we decide to do that.

This makes us rely even more on PlayOrm's command line tool(instead of
cassandra-cli) as I can't stand reading hex all the time nor do I like
switching my assume validator to utf8 to decimal, to integer just so I
can read stuff.

Later,
Dean

On 10/1/12 9:22 AM, Brian O'Neill b...@alumni.brown.edu wrote:

Dean,

We have the same question...

We have thousands of separate feeds of data as well (20,000+).  To
date, we've been using a CF per feed strategy, but as we scale this
thing out to accommodate all of those feeds, we're trying to figure
out if we're going to blow out the memory.

The initial documentation for heap sizing had column families in the
equation:
http://www.datastax.com/docs/0.7/operations/tuning#heap-sizing

But in the more recent documentation, it looks like they removed the
column family variable with the introduction of the universal
key_cache_size.
http://www.datastax.com/docs/1.0/operations/tuning#tuning-java-heap-size

We haven't committed either way yet, but given Ed Anuff's presentation
on virtual keyspaces, we were leaning towards a single column family
approach:
http://blog.apigee.com/detail/building_a_mobile_data_platform_with_cassand
ra_-_apigee_under_the_hood/?

Definitely let us know what you decide.

-brian

On Fri, Sep 28, 2012 at 11:48 AM, Flavio Baronti
f.baro...@list-group.com wrote:
 We had some serious trouble with dynamically adding CFs, although last
time
 we tried we were using version 0.7, so maybe
 that's not an issue any more.
 Our problems were two:
 - You are (were?) not supposed to add CFs concurrently. Since we had
more
 servers talking to the same Cassandra cluster,
 we had to use distributed locks (Hazelcast) to avoid concurrency.
 - You must be very careful to add new CFs to different Cassandra nodes.
If
 you do that fast enough, and the clocks of
 the two servers are skewed, you will severely compromise your schema
 (Cassandra will not understand in which order the
 updates must be applied).

 As I said, this applied to version 0.7, maybe current versions solved
these
 problems.

 Flavio


 Il 2012/09/27 16:11 PM, Hiller, Dean ha scritto:
 We have 1000's of different building devices and we stream data from
these
 devices.  The format and data from each one varies so one device has
temperature
 at timeX with some other variables, another device has CO2 percentage
and other
 variables.  Every device is unique and streams it's own data.  We
dynamically
 discover devices and register them.  Basically, one CF or table per
thing really
 makes sense in this environment.  While we could try to find out which
devices
 are similar, this would really be a pain and some devices add some new
 variable into the equation.  NOT only that but researchers can register
new
 datasets and upload them as well and each dataset they have they do NOT
want to
 share with other researches necessarily so we have security groups and
each CF
 belongs to security groups.  We dynamically create CF's on the fly as
people
 register new datasets.

 On top of that, when the data sets get too large, we probably want to
 partition a single CF into time partitions.  We could create one CF and
put all
 the data and have a partition per device, but then a time partition
will contain
 multiple devices of data meaning we need to shrink our time partition
size
 where if we have CF per device, the time partition can be larger as it
is only
 for that one device.

 THEN, on top of that, we have a meta CF for these devices so some
people want
 to query for streams that match criteria AND which returns a CF name
and they
 query that CF name so we almost need a query with variables like select
cfName
 from Meta where x = y and then select * from cfName where x. Which
we can do
 today.

 Dean

 From: Marcelo Elias Del Valle
mvall...@gmail.commailto:mvall...@gmail.com
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Thursday, September 27, 2012 8:01 AM
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Subject: Re: 1000's of column families

 Out of curiosity, is it really necessary to have that amount of CFs?
 I am probably still used to relational databases, where you would use
a new
 table just in case you need to store different kinds of data. As
Cassandra
 stores anything in each CF, it might probably make sense to have a lot
of CFs to
 store your data...
 But why wouldn't you use a single CF with partitions in these case?
Wouldn't
 it be the same thing? I am asking because I might learn a new modeling

Re: 1000's of column families

2012-10-01 Thread Ben Hood

Brian,

On Mon, Oct 1, 2012 at 4:22 PM, Brian O'Neill b...@alumni.brown.edu wrote:
 We haven't committed either way yet, but given Ed Anuff's presentation
 on virtual keyspaces, we were leaning towards a single column family
 approach:
 http://blog.apigee.com/detail/building_a_mobile_data_platform_with_cassandra_-_apigee_under_the_hood/?

Is this doing something special or is this just a convenience way of
prefixing keys to make the storage space multi-tenanted?

Cheers,

Ben

Re: 1000's of column families

2012-10-01 Thread Brian O'Neill

Its just a convenient way of prefixing:
http://hector-client.github.com/hector/build/html/content/virtual_keyspaces.html

-brian

On Mon, Oct 1, 2012 at 4:22 PM, Ben Hood 0x6e6...@gmail.com wrote:
 Brian,

 On Mon, Oct 1, 2012 at 4:22 PM, Brian O'Neill b...@alumni.brown.edu wrote:
 We haven't committed either way yet, but given Ed Anuff's presentation
 on virtual keyspaces, we were leaning towards a single column family
 approach:
 http://blog.apigee.com/detail/building_a_mobile_data_platform_with_cassandra_-_apigee_under_the_hood/?

 Is this doing something special or is this just a convenience way of
 prefixing keys to make the storage space multi-tenanted?

 Cheers,

 Ben



-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)

mobile:215.588.6024
blog: http://brianoneill.blogspot.com/
twitter: @boneill42

Re: 1000's of column families

2012-10-01 Thread Ben Hood

On Mon, Oct 1, 2012 at 9:38 PM, Brian O'Neill b...@alumni.brown.edu wrote:
 Its just a convenient way of prefixing:
 http://hector-client.github.com/hector/build/html/content/virtual_keyspaces.html

So given that it is possible to use a CF per tenant, should we assume
that there at sufficient scale that there is less overhead to prefix
keys than there is to manage multiple CFs?

Ben

Re: 1000's of column families

2012-09-28 Thread Hiller, Dean

I thought someone was saying each column family added to RAM on every node not 
RAM on a single node.  It adds RAM on every node???  So eventually, I will run 
out?  Was that person wrong?  This would mean adding nodes does not help if he 
is right.  Can anyone confirm this?

Thanks,
Dean

From: Robin Verlangen ro...@us2.nlmailto:ro...@us2.nl
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, September 27, 2012 11:52 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: 1000's of column families

so if you add up all the applications
which would be huge and then all the tables which is large, it just keeps
growing.  It is a very nice concept(all data in one location), though we
will see how implementing it goes.

This shouldn't be a real problem for Cassandra. Just add more nodes and ever 
node contains a smaller piece of the cake (~ring).

Best regards,

Robin Verlangen
Software engineer

W http://www.robinverlangen.nl
E ro...@us2.nlmailto:ro...@us2.nl

[http://static.cloudpelican.com/images/CloudPelican-email-signature.jpg]http://goo.gl/Lt7BC

Disclaimer: The information contained in this message and attachments is 
intended solely for the attention and use of the named addressee and may be 
confidential. If you are not the intended recipient, you are reminded that the 
information remains the property of the sender. You must not use, disclose, 
distribute, copy, print or rely on this e-mail. If you have received this 
message in error, please contact the sender immediately and irrevocably delete 
this message and any copies.



2012/9/27 Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov
Unfortunately, the security aspect is very strict.  Some make their data
public but there are many projects where due to client contracts, they
cannot make their data public within our company(ie. Other groups in our
company are not allowed to see the data).

Also, currently, we have researchers upload their own datasets as well.
Ideally, they want to see this one noSQL store as the place where all data
for the company livesŠALL of it so if you add up all the applications
which would be huge and then all the tables which is large, it just keeps
growing.  It is a very nice concept(all data in one location), though we
will see how implementing it goes.

How much overhead per column family in RAM?  So far we have around 4000
Cfs with no issue that I see yet.

Dean

On 9/27/12 11:10 AM, Aaron Turner 
synfina...@gmail.commailto:synfina...@gmail.com wrote:

On Thu, Sep 27, 2012 at 3:11 PM, Hiller, Dean 
dean.hil...@nrel.govmailto:dean.hil...@nrel.gov
wrote:
 We have 1000's of different building devices and we stream data from
these devices.  The format and data from each one varies so one device
has temperature at timeX with some other variables, another device has
CO2 percentage and other variables.  Every device is unique and streams
it's own data.  We dynamically discover devices and register them.
Basically, one CF or table per thing really makes sense in this
environment.  While we could try to find out which devices are
similar, this would really be a pain and some devices add some new
variable into the equation.  NOT only that but researchers can register
new datasets and upload them as well and each dataset they have they do
NOT want to share with other researches necessarily so we have security
groups and each CF belongs to security groups.  We dynamically create
CF's on the fly as people register new datasets.

 On top of that, when the data sets get too large, we probably want to
partition a single CF into time partitions.  We could create one CF and
put all the data and have a partition per device, but then a time
partition will contain multiple devices of data meaning we need to
shrink our time partition size where if we have CF per device, the time
partition can be larger as it is only for that one device.

 THEN, on top of that, we have a meta CF for these devices so some
people want to query for streams that match criteria AND which returns a
CF name and they query that CF name so we almost need a query with
variables like select cfName from Meta where x = y and then select *
from cfName where x. Which we can do today.

How strict are your security requirements?  If it wasn't for that,
you'd be much better off storing data on a per-statistic basis then
per-device.  Hell, you could store everything in a single CF by using
a composite row key:

devicename|stat type|instance

But yeah, there isn't a hard limit for the number of CF's, but there
is overhead associated with each one and so I wouldn't consider your
design as scalable.  Generally speaking, hundreds are ok, but
thousands is pushing it.



--
Aaron Turner
http://synfin.net/ Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools

Re: 1000's of column families

2012-09-28 Thread Robin Verlangen

I think I misunderstood your all data in one location note. I thought you
meant to store it all in one CF.

Best regards,

Robin Verlangen
*Software engineer*
*
*
W http://www.robinverlangen.nl
E ro...@us2.nl

http://goo.gl/Lt7BC

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.



2012/9/28 Hiller, Dean dean.hil...@nrel.gov

 I thought someone was saying each column family added to RAM on every node
 not RAM on a single node.  It adds RAM on every node???  So eventually, I
 will run out?  Was that person wrong?  This would mean adding nodes does
 not help if he is right.  Can anyone confirm this?

 Thanks,
 Dean

 From: Robin Verlangen ro...@us2.nlmailto:ro...@us2.nl
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Thursday, September 27, 2012 11:52 PM
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Subject: Re: 1000's of column families

 so if you add up all the applications
 which would be huge and then all the tables which is large, it just keeps
 growing.  It is a very nice concept(all data in one location), though we
 will see how implementing it goes.

 This shouldn't be a real problem for Cassandra. Just add more nodes and
 ever node contains a smaller piece of the cake (~ring).

 Best regards,

 Robin Verlangen
 Software engineer

 W http://www.robinverlangen.nl
 E ro...@us2.nlmailto:ro...@us2.nl

 [http://static.cloudpelican.com/images/CloudPelican-email-signature.jpg]
 http://goo.gl/Lt7BC

 Disclaimer: The information contained in this message and attachments is
 intended solely for the attention and use of the named addressee and may be
 confidential. If you are not the intended recipient, you are reminded that
 the information remains the property of the sender. You must not use,
 disclose, distribute, copy, print or rely on this e-mail. If you have
 received this message in error, please contact the sender immediately and
 irrevocably delete this message and any copies.



 2012/9/27 Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov
 Unfortunately, the security aspect is very strict.  Some make their data
 public but there are many projects where due to client contracts, they
 cannot make their data public within our company(ie. Other groups in our
 company are not allowed to see the data).

 Also, currently, we have researchers upload their own datasets as well.
 Ideally, they want to see this one noSQL store as the place where all data
 for the company livesŠALL of it so if you add up all the applications
 which would be huge and then all the tables which is large, it just keeps
 growing.  It is a very nice concept(all data in one location), though we
 will see how implementing it goes.

 How much overhead per column family in RAM?  So far we have around 4000
 Cfs with no issue that I see yet.

 Dean

 On 9/27/12 11:10 AM, Aaron Turner synfina...@gmail.commailto:
 synfina...@gmail.com wrote:

 On Thu, Sep 27, 2012 at 3:11 PM, Hiller, Dean dean.hil...@nrel.gov
 mailto:dean.hil...@nrel.gov
 wrote:
  We have 1000's of different building devices and we stream data from
 these devices.  The format and data from each one varies so one device
 has temperature at timeX with some other variables, another device has
 CO2 percentage and other variables.  Every device is unique and streams
 it's own data.  We dynamically discover devices and register them.
 Basically, one CF or table per thing really makes sense in this
 environment.  While we could try to find out which devices are
 similar, this would really be a pain and some devices add some new
 variable into the equation.  NOT only that but researchers can register
 new datasets and upload them as well and each dataset they have they do
 NOT want to share with other researches necessarily so we have security
 groups and each CF belongs to security groups.  We dynamically create
 CF's on the fly as people register new datasets.
 
  On top of that, when the data sets get too large, we probably want to
 partition a single CF into time partitions.  We could create one CF and
 put all the data and have a partition per device, but then a time
 partition will contain multiple devices of data meaning we need to
 shrink our time partition size where if we have CF per device, the time
 partition can be larger as it is only for that one device.
 
  THEN, on top of that, we have a meta CF for these devices so some
 people want to query for streams

Re: 1000's of column families

2012-09-28 Thread Aaron Turner

Yeah, you can't scale the number of CF's by adding new nodes to a
cluster. You have to create multiple clusters.

Anyways, I was thinking about your problem and the solution seems to
be give each team/project their own CF and have them use composite row
keys as I wrote about earlier. Yes that may mean you store the data
for the same node multiple times, but that's pretty typical with
Cassandra where you're de-normalizing your data to meet your query
needs and Cassandra does seem to scale that way. Also, if you're not
already using compression, you should. My experience with compression
and time series data has been pretty amazing, especially with my CF's
where I store a days worth of data in a single column as a vector.

That gives you the best of both worlds: You get your per-team security
and I'd assume (ha!) that would dramatically reduce the number of CF's
you have to deal with since it's per-team/project and not per-device.

On Fri, Sep 28, 2012 at 12:14 PM, Hiller, Dean dean.hil...@nrel.gov wrote:
I thought someone was saying each column family added to RAM on every node
not RAM on a single node. It adds RAM on every node??? So eventually, I
will run out? Was that person wrong? This would mean adding nodes does not
help if he is right. Can anyone confirm this?

Thanks,
Dean

From: Robin Verlangen ro...@us2.nlmailto:ro...@us2.nl
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, September 27, 2012 11:52 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: 1000's of column families

so if you add up all the applications
which would be huge and then all the tables which is large, it just keeps
growing. It is a very nice concept(all data in one location), though we
will see how implementing it goes.

This shouldn't be a real problem for Cassandra. Just add more nodes and ever
node contains a smaller piece of the cake (~ring).

Best regards,

Robin Verlangen
Software engineer

W http://www.robinverlangen.nl
E ro...@us2.nlmailto:ro...@us2.nl

[http://static.cloudpelican.com/images/CloudPelican-email-signature.jpg]http://goo.gl/Lt7BC

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

2012/9/27 Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov
Unfortunately, the security aspect is very strict. Some make their data
public but there are many projects where due to client contracts, they
cannot make their data public within our company(ie. Other groups in our
company are not allowed to see the data).

Also, currently, we have researchers upload their own datasets as well.
Ideally, they want to see this one noSQL store as the place where all data
for the company livesŠALL of it so if you add up all the applications
which would be huge and then all the tables which is large, it just keeps
growing. It is a very nice concept(all data in one location), though we
will see how implementing it goes.

How much overhead per column family in RAM? So far we have around 4000
Cfs with no issue that I see yet.

Dean

On 9/27/12 11:10 AM, Aaron Turner
synfina...@gmail.commailto:synfina...@gmail.com wrote:

On Thu, Sep 27, 2012 at 3:11 PM, Hiller, Dean
dean.hil...@nrel.govmailto:dean.hil...@nrel.gov
wrote:
We have 1000's of different building devices and we stream data from
these devices. The format and data from each one varies so one device
has temperature at timeX with some other variables, another device has
CO2 percentage and other variables. Every device is unique and streams
it's own data. We dynamically discover devices and register them.
Basically, one CF or table per thing really makes sense in this
environment. While we could try to find out which devices are
similar, this would really be a pain and some devices add some new
variable into the equation. NOT only that but researchers can register
new datasets and upload them as well and each dataset they have they do
NOT want to share with other researches necessarily so we have security
groups and each CF belongs to security groups. We dynamically create
CF's on the fly as people register new datasets.

Re: 1000's of column families

2012-09-28 Thread Flavio Baronti

We had some serious trouble with dynamically adding CFs, although last time we tried we were using version 0.7, so maybe 
that's not an issue any more.

Our problems were two:
- You are (were?) not supposed to add CFs concurrently. Since we had more servers talking to the same Cassandra cluster, 
we had to use distributed locks (Hazelcast) to avoid concurrency.
- You must be very careful to add new CFs to different Cassandra nodes. If you do that fast enough, and the clocks of 
the two servers are skewed, you will severely compromise your schema (Cassandra will not understand in which order the 
updates must be applied).


As I said, this applied to version 0.7, maybe current versions solved these 
problems.

Flavio


Il 2012/09/27 16:11 PM, Hiller, Dean ha scritto:

We have 1000's of different building devices and we stream data from these devices.  The 
format and data from each one varies so one device has temperature at timeX with some 
other variables, another device has CO2 percentage and other variables.  Every device is 
unique and streams it's own data.  We dynamically discover devices and register them.  
Basically, one CF or table per thing really makes sense in this environment.  While we 
could try to find out which devices are similar, this would really be a pain 
and some devices add some new variable into the equation.  NOT only that but researchers 
can register new datasets and upload them as well and each dataset they have they do NOT 
want to share with other researches necessarily so we have security groups and each CF 
belongs to security groups.  We dynamically create CF's on the fly as people register new 
datasets.

On top of that, when the data sets get too large, we probably want to partition a single 
CF into time partitions.  We could create one CF and put all the data and have a 
partition per device, but then a time partition will contain multiple devices 
of data meaning we need to shrink our time partition size where if we have CF per device, 
the time partition can be larger as it is only for that one device.

THEN, on top of that, we have a meta CF for these devices so some people want 
to query for streams that match criteria AND which returns a CF name and they 
query that CF name so we almost need a query with variables like select cfName 
from Meta where x = y and then select * from cfName where x. Which we can 
do today.

Dean

From: Marcelo Elias Del Valle mvall...@gmail.commailto:mvall...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, September 27, 2012 8:01 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: 1000's of column families

Out of curiosity, is it really necessary to have that amount of CFs?
I am probably still used to relational databases, where you would use a new 
table just in case you need to store different kinds of data. As Cassandra 
stores anything in each CF, it might probably make sense to have a lot of CFs 
to store your data...
But why wouldn't you use a single CF with partitions in these case? Wouldn't it 
be the same thing? I am asking because I might learn a new modeling technique 
with the answer.

[]s

2012/9/26 Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov
We are streaming data with 1 stream per 1 CF and we have 1000's of CF.  When 
using the tools they are all geared to analyzing ONE column family at a time 
:(.  If I remember correctly, Cassandra supports as many CF's as you want, 
correct?  Even though I am going to have tons of funs with limitations on the 
tools, correct?

(I may end up wrapping the node tool with my own aggregate calls if needed to 
sum up multiple column families and such).

Thanks,
Dean



--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: 1000's of column families

2012-09-27 Thread Sylvain Lebresne

On Thu, Sep 27, 2012 at 12:13 AM, Hiller, Dean dean.hil...@nrel.gov wrote:
 We are streaming data with 1 stream per 1 CF and we have 1000's of CF.  When 
 using the tools they are all geared to analyzing ONE column family at a time 
 :(.  If I remember correctly, Cassandra supports as many CF's as you want, 
 correct?  Even though I am going to have tons of funs with limitations on the 
 tools, correct?

 (I may end up wrapping the node tool with my own aggregate calls if needed to 
 sum up multiple column families and such).

Is there a non rhetorical question in there? Maybe is that a feature
request in disguise?

--
Sylvain

Re: 1000's of column families

2012-09-27 Thread Robin Verlangen

Every CF adds some overhead (in memory) to each node. This is something you
should really keep in mind.

Best regards,

Robin Verlangen
*Software engineer*
*
*
W http://www.robinverlangen.nl
E ro...@us2.nl

http://goo.gl/Lt7BC

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.



2012/9/27 Sylvain Lebresne sylv...@datastax.com

 On Thu, Sep 27, 2012 at 12:13 AM, Hiller, Dean dean.hil...@nrel.gov
 wrote:
  We are streaming data with 1 stream per 1 CF and we have 1000's of CF.
  When using the tools they are all geared to analyzing ONE column family at
 a time :(.  If I remember correctly, Cassandra supports as many CF's as you
 want, correct?  Even though I am going to have tons of funs with
 limitations on the tools, correct?
 
  (I may end up wrapping the node tool with my own aggregate calls if
 needed to sum up multiple column families and such).

 Is there a non rhetorical question in there? Maybe is that a feature
 request in disguise?

 --
 Sylvain

Re: 1000's of column families

Is there a non rhetorical question in there? Maybe is that a feature
request in disguise?


The question was basically, Is Cassandra ok with as many CF's as you want?
 It sounds like it is not based on the email that every CF causes a bit
more RAM to be used though.  So if cassandra is not ok with as many CF's
as you want, does anyone know what that limit would be for 16G of RAM or
something I could calculate with?

Thanks,
Dean

On 9/27/12 2:37 AM, Sylvain Lebresne sylv...@datastax.com wrote:

On Thu, Sep 27, 2012 at 12:13 AM, Hiller, Dean dean.hil...@nrel.gov
wrote:
 We are streaming data with 1 stream per 1 CF and we have 1000's of CF.
When using the tools they are all geared to analyzing ONE column family
at a time :(.  If I remember correctly, Cassandra supports as many CF's
as you want, correct?  Even though I am going to have tons of funs with
limitations on the tools, correct?

 (I may end up wrapping the node tool with my own aggregate calls if
needed to sum up multiple column families and such).

Is there a non rhetorical question in there? Maybe is that a feature
request in disguise?

--
Sylvain

Re: 1000's of column families

2012-09-27 Thread Marcelo Elias Del Valle

Out of curiosity, is it really necessary to have that amount of CFs?
I am probably still used to relational databases, where you would use a new
table just in case you need to store different kinds of data. As Cassandra
stores anything in each CF, it might probably make sense to have a lot of
CFs to store your data...
But why wouldn't you use a single CF with partitions in these case?
Wouldn't it be the same thing? I am asking because I might learn a new
modeling technique with the answer.

[]s

2012/9/26 Hiller, Dean dean.hil...@nrel.gov

 We are streaming data with 1 stream per 1 CF and we have 1000's of CF.
  When using the tools they are all geared to analyzing ONE column family at
 a time :(.  If I remember correctly, Cassandra supports as many CF's as you
 want, correct?  Even though I am going to have tons of funs with
 limitations on the tools, correct?

 (I may end up wrapping the node tool with my own aggregate calls if needed
 to sum up multiple column families and such).

 Thanks,
 Dean




-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: 1000's of column families

We have 1000's of different building devices and we stream data from these 
devices.  The format and data from each one varies so one device has 
temperature at timeX with some other variables, another device has CO2 
percentage and other variables.  Every device is unique and streams it's own 
data.  We dynamically discover devices and register them.  Basically, one CF or 
table per thing really makes sense in this environment.  While we could try to 
find out which devices are similar, this would really be a pain and some 
devices add some new variable into the equation.  NOT only that but researchers 
can register new datasets and upload them as well and each dataset they have 
they do NOT want to share with other researches necessarily so we have security 
groups and each CF belongs to security groups.  We dynamically create CF's on 
the fly as people register new datasets.

On top of that, when the data sets get too large, we probably want to partition 
a single CF into time partitions.  We could create one CF and put all the data 
and have a partition per device, but then a time partition will contain 
multiple devices of data meaning we need to shrink our time partition size 
where if we have CF per device, the time partition can be larger as it is only 
for that one device.

THEN, on top of that, we have a meta CF for these devices so some people want 
to query for streams that match criteria AND which returns a CF name and they 
query that CF name so we almost need a query with variables like select cfName 
from Meta where x = y and then select * from cfName where x. Which we can 
do today.

Dean

From: Marcelo Elias Del Valle mvall...@gmail.commailto:mvall...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, September 27, 2012 8:01 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: 1000's of column families

Out of curiosity, is it really necessary to have that amount of CFs?
I am probably still used to relational databases, where you would use a new 
table just in case you need to store different kinds of data. As Cassandra 
stores anything in each CF, it might probably make sense to have a lot of CFs 
to store your data...
But why wouldn't you use a single CF with partitions in these case? Wouldn't it 
be the same thing? I am asking because I might learn a new modeling technique 
with the answer.

[]s

2012/9/26 Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov
We are streaming data with 1 stream per 1 CF and we have 1000's of CF.  When 
using the tools they are all geared to analyzing ONE column family at a time 
:(.  If I remember correctly, Cassandra supports as many CF's as you want, 
correct?  Even though I am going to have tons of funs with limitations on the 
tools, correct?

(I may end up wrapping the node tool with my own aggregate calls if needed to 
sum up multiple column families and such).

Thanks,
Dean



--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: 1000's of column families

2012-09-27 Thread Marcelo Elias Del Valle

Dean,

 I was used, in the relational world, to use hibernate and O/R mapping.
There were times when I used 3 classes (2 inheriting from 1 another) and
mapped all of the to 1 table. The common part was in the super class and
each sub class had it's own columns. The table, however, use to have all
the columns and this design was hard because of that, as creating more
subclasses would need changes in the table.
 However, if you use playOrm and if playOrm has/had a feature to allow
inheritance mapping to a CF, it would solve your problem, wouldn't it? Of
course it is probably much harder than it might problably appear... :D

Best regards,
Marcelo Valle.

2012/9/27 Hiller, Dean dean.hil...@nrel.gov

 We have 1000's of different building devices and we stream data from these
 devices.  The format and data from each one varies so one device has
 temperature at timeX with some other variables, another device has CO2
 percentage and other variables.  Every device is unique and streams it's
 own data.  We dynamically discover devices and register them.  Basically,
 one CF or table per thing really makes sense in this environment.  While we
 could try to find out which devices are similar, this would really be a
 pain and some devices add some new variable into the equation.  NOT only
 that but researchers can register new datasets and upload them as well and
 each dataset they have they do NOT want to share with other researches
 necessarily so we have security groups and each CF belongs to security
 groups.  We dynamically create CF's on the fly as people register new
 datasets.

 On top of that, when the data sets get too large, we probably want to
 partition a single CF into time partitions.  We could create one CF and put
 all the data and have a partition per device, but then a time partition
 will contain multiple devices of data meaning we need to shrink our time
 partition size where if we have CF per device, the time partition can be
 larger as it is only for that one device.

 THEN, on top of that, we have a meta CF for these devices so some people
 want to query for streams that match criteria AND which returns a CF name
 and they query that CF name so we almost need a query with variables like
 select cfName from Meta where x = y and then select * from cfName where
 x. Which we can do today.

 Dean

 From: Marcelo Elias Del Valle mvall...@gmail.commailto:
 mvall...@gmail.com
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Thursday, September 27, 2012 8:01 AM
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Subject: Re: 1000's of column families

 Out of curiosity, is it really necessary to have that amount of CFs?
 I am probably still used to relational databases, where you would use a
 new table just in case you need to store different kinds of data. As
 Cassandra stores anything in each CF, it might probably make sense to have
 a lot of CFs to store your data...
 But why wouldn't you use a single CF with partitions in these case?
 Wouldn't it be the same thing? I am asking because I might learn a new
 modeling technique with the answer.

 []s

 2012/9/26 Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov
 We are streaming data with 1 stream per 1 CF and we have 1000's of CF.
  When using the tools they are all geared to analyzing ONE column family at
 a time :(.  If I remember correctly, Cassandra supports as many CF's as you
 want, correct?  Even though I am going to have tons of funs with
 limitations on the tools, correct?

 (I may end up wrapping the node tool with my own aggregate calls if needed
 to sum up multiple column families and such).

 Thanks,
 Dean



 --
 Marcelo Elias Del Valle
 http://mvalle.com - @mvallebr




-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: 1000's of column families

PlayOrm DOES support inheritance mapping but only supports single table right 
now.  In fact, DboColumnMeta.java  has 4 subclasses that all map to that one 
ColumnFamily so we already support and heavily use the inheritance feature.

That said, I am more concerned with scalability.  The more you stuff into a 
table, the more partitions you need….as an example, I really have a choice

Have this in a partition
device1 datapoint1
device2 datapoint1
device1 datapoint2
device2 datapoint2
device1 datapoint3

OR have just this in a partition
device1 datapoint1
device1 datapoint1
device1 datapoint1

If I use the latter approach, I can have more points for device1 in one 
partition.  I could use inheritance but then I can't fit as many data points 
for device 1 in a partition.

Does that make more sense?

Later,
Dean


From: Marcelo Elias Del Valle mvall...@gmail.commailto:mvall...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, September 27, 2012 8:45 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: 1000's of column families

Dean,

 I was used, in the relational world, to use hibernate and O/R mapping. 
There were times when I used 3 classes (2 inheriting from 1 another) and mapped 
all of the to 1 table. The common part was in the super class and each sub 
class had it's own columns. The table, however, use to have all the columns and 
this design was hard because of that, as creating more subclasses would need 
changes in the table.
 However, if you use playOrm and if playOrm has/had a feature to allow 
inheritance mapping to a CF, it would solve your problem, wouldn't it? Of 
course it is probably much harder than it might problably appear... :D

Best regards,
Marcelo Valle.

2012/9/27 Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov
We have 1000's of different building devices and we stream data from these 
devices.  The format and data from each one varies so one device has 
temperature at timeX with some other variables, another device has CO2 
percentage and other variables.  Every device is unique and streams it's own 
data.  We dynamically discover devices and register them.  Basically, one CF or 
table per thing really makes sense in this environment.  While we could try to 
find out which devices are similar, this would really be a pain and some 
devices add some new variable into the equation.  NOT only that but researchers 
can register new datasets and upload them as well and each dataset they have 
they do NOT want to share with other researches necessarily so we have security 
groups and each CF belongs to security groups.  We dynamically create CF's on 
the fly as people register new datasets.

On top of that, when the data sets get too large, we probably want to partition 
a single CF into time partitions.  We could create one CF and put all the data 
and have a partition per device, but then a time partition will contain 
multiple devices of data meaning we need to shrink our time partition size 
where if we have CF per device, the time partition can be larger as it is only 
for that one device.

THEN, on top of that, we have a meta CF for these devices so some people want 
to query for streams that match criteria AND which returns a CF name and they 
query that CF name so we almost need a query with variables like select cfName 
from Meta where x = y and then select * from cfName where x. Which we can 
do today.

Dean

From: Marcelo Elias Del Valle 
mvall...@gmail.commailto:mvall...@gmail.commailto:mvall...@gmail.commailto:mvall...@gmail.com
Reply-To: 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, September 27, 2012 8:01 AM
To: 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: 1000's of column families

Out of curiosity, is it really necessary to have that amount of CFs?
I am probably still used to relational databases, where you would use a new 
table just in case you need to store different kinds of data. As Cassandra 
stores anything in each CF, it might probably make sense to have a lot of CFs 
to store your data...
But why wouldn't you use a single CF with partitions in these case? Wouldn't it 
be the same thing? I am asking because I might learn a new modeling technique 
with the answer.

[]s

2012/9/26 Hiller, Dean 
dean.hil...@nrel.govmailto:dean.hil...@nrel.govmailto:dean.hil...@nrel.govmailto:dean.hil...@nrel.gov
We are streaming data with 1 stream per

Re: 1000's of column families

2012-09-27 Thread Edward Capriolo

Hector also offers support for 'Virtual Keyspaces' which you might
want to look at.


On Thu, Sep 27, 2012 at 1:10 PM, Aaron Turner synfina...@gmail.com wrote:
 On Thu, Sep 27, 2012 at 3:11 PM, Hiller, Dean dean.hil...@nrel.gov wrote:
 We have 1000's of different building devices and we stream data from these 
 devices.  The format and data from each one varies so one device has 
 temperature at timeX with some other variables, another device has CO2 
 percentage and other variables.  Every device is unique and streams it's own 
 data.  We dynamically discover devices and register them.  Basically, one CF 
 or table per thing really makes sense in this environment.  While we could 
 try to find out which devices are similar, this would really be a pain and 
 some devices add some new variable into the equation.  NOT only that but 
 researchers can register new datasets and upload them as well and each 
 dataset they have they do NOT want to share with other researches 
 necessarily so we have security groups and each CF belongs to security 
 groups.  We dynamically create CF's on the fly as people register new 
 datasets.

 On top of that, when the data sets get too large, we probably want to 
 partition a single CF into time partitions.  We could create one CF and put 
 all the data and have a partition per device, but then a time partition will 
 contain multiple devices of data meaning we need to shrink our time 
 partition size where if we have CF per device, the time partition can be 
 larger as it is only for that one device.

 THEN, on top of that, we have a meta CF for these devices so some people 
 want to query for streams that match criteria AND which returns a CF name 
 and they query that CF name so we almost need a query with variables like 
 select cfName from Meta where x = y and then select * from cfName where 
 x. Which we can do today.

 How strict are your security requirements?  If it wasn't for that,
 you'd be much better off storing data on a per-statistic basis then
 per-device.  Hell, you could store everything in a single CF by using
 a composite row key:

 devicename|stat type|instance

 But yeah, there isn't a hard limit for the number of CF's, but there
 is overhead associated with each one and so I wouldn't consider your
 design as scalable.  Generally speaking, hundreds are ok, but
 thousands is pushing it.



 --
 Aaron Turner
 http://synfin.net/ Twitter: @synfinatic
 http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix  
 Windows
 Those who would give up essential Liberty, to purchase a little temporary
 Safety, deserve neither Liberty nor Safety.
 -- Benjamin Franklin
 carpe diem quam minimum credula postero

Re: 1000's of column families

2012-09-27 Thread Aaron Turner

On Thu, Sep 27, 2012 at 7:35 PM, Marcelo Elias Del Valle
mvall...@gmail.com wrote:


 2012/9/27 Aaron Turner synfina...@gmail.com

 How strict are your security requirements?  If it wasn't for that,
 you'd be much better off storing data on a per-statistic basis then
 per-device.  Hell, you could store everything in a single CF by using
 a composite row key:

 devicename|stat type|instance

 But yeah, there isn't a hard limit for the number of CF's, but there
 is overhead associated with each one and so I wouldn't consider your
 design as scalable.  Generally speaking, hundreds are ok, but
 thousands is pushing it.


 Aaron,

 Imagine that instead of using a composite key in this case, you use a
 simple row key instance_uuid. Then, to index data by devicename |
 start_type|instance you use another CF with this composite key or several
 CFs to index it.
 Do you see any drawbacks in terms of performance?

Really that depends on the client side I think.  Ideally, you'd like
to have the client be able to be able to directly access the row by
name without looking it up in some index.  Basically if you have to
lookup up the instance_uuid that's another call to some datastore
which takes more time then generating the row key via it's composites.
 At least that's my opinion...

Of course there are times where using an instance_uuid makes a lot of
sense... like if you rename a device and want all your stats to move
to the new name.  Much easier to just update the mapping record then
reading  rewriting all your rows for that device.

In my project, we use a device_uuid (just a primary key stored in an
Oracle DB... long story!), but everything else is by name in our
composite row keys.


-- 
Aaron Turner
http://synfin.net/ Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix  Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
-- Benjamin Franklin
carpe diem quam minimum credula postero

Re: 1000's of column families