Re: Once again, super columns or composites?

2012-09-27 Thread Sylvain Lebresne
When people suggest composites instead of super columns, they mean
composite column 'names', not composite column 'values'. None of the
advantages you cite stand in the case of composite column 'names'.

--
Sylvain

On Wed, Sep 26, 2012 at 11:52 PM, Edward Kibardin infa...@gmail.com wrote:
 Hi Community,

 I know, I know... every one is claiming Super Columns are not good enough
 and it dangerous to use them now.
 But from my perspective, they have several very good advantages like:

 You are not fixed schema and always can add one more columns to subset of
 your supercolumns
 SuperColumn is loaded as whole if you requesting for at least one sub
 column, but it's the same as loading a whole composite value to get only one
 sub-value
 In supercolumns you can update only one subcolumn without touching other
 subcolumns, in case of composites you're unable to update just a portion of
 composite value.

 May be I do not understand composites correctly, but having very small
 supercolumns (10-15 subcolumns) I still think SuperColumns might be the best
 solution for me...
 In addition, building supercolumns with SSTableWriter is pretty much
 strait-forward for me, while it's not the case with composites...

 Any arguments?




Re: 1000's of column families

2012-09-27 Thread Sylvain Lebresne
On Thu, Sep 27, 2012 at 12:13 AM, Hiller, Dean dean.hil...@nrel.gov wrote:
 We are streaming data with 1 stream per 1 CF and we have 1000's of CF.  When 
 using the tools they are all geared to analyzing ONE column family at a time 
 :(.  If I remember correctly, Cassandra supports as many CF's as you want, 
 correct?  Even though I am going to have tons of funs with limitations on the 
 tools, correct?

 (I may end up wrapping the node tool with my own aggregate calls if needed to 
 sum up multiple column families and such).

Is there a non rhetorical question in there? Maybe is that a feature
request in disguise?

--
Sylvain


Re: compression

2012-09-27 Thread Tamar Fraenkel
Hi!
First, the problem is still there, altough I checked and all node agree on
the schema.
This is from ls -l
Good Node
-rw-r--r-- 1 cassandra cassandra606 2012-09-27 08:01
tk_usus_user-hc-269-CompressionInfo.db
-rw-r--r-- 1 cassandra cassandra2246431 2012-09-27 08:01
tk_usus_user-hc-269-Data.db
-rw-r--r-- 1 cassandra cassandra  11056 2012-09-27 08:01
tk_usus_user-hc-269-Filter.db
-rw-r--r-- 1 cassandra cassandra 129792 2012-09-27 08:01
tk_usus_user-hc-269-Index.db
-rw-r--r-- 1 cassandra cassandra   4336 2012-09-27 08:01
tk_usus_user-hc-269-Statistics.db

Node 2
-rw-r--r-- 1 cassandra cassandra4592393 2012-09-27 08:01
tk_usus_user-hc-268-Data.db
-rw-r--r-- 1 cassandra cassandra 69 2012-09-27 08:01
tk_usus_user-hc-268-Digest.sha1
-rw-r--r-- 1 cassandra cassandra  11056 2012-09-27 08:01
tk_usus_user-hc-268-Filter.db
-rw-r--r-- 1 cassandra cassandra 129792 2012-09-27 08:01
tk_usus_user-hc-268-Index.db
-rw-r--r-- 1 cassandra cassandra   4336 2012-09-27 08:01
tk_usus_user-hc-268-Statistics.db

Node 3
-rw-r--r-- 1 cassandra cassandra   4592393 2012-09-27 08:01
tk_usus_user-hc-278-Data.db
-rw-r--r-- 1 cassandra cassandra69 2012-09-27 08:01
tk_usus_user-hc-278-Digest.sha1
-rw-r--r-- 1 cassandra cassandra 11056 2012-09-27 08:01
tk_usus_user-hc-278-Filter.db
-rw-r--r-- 1 cassandra cassandra129792 2012-09-27 08:01
tk_usus_user-hc-278-Index.db
-rw-r--r-- 1 cassandra cassandra  4336 2012-09-27 08:01
tk_usus_user-hc-278-Statistics.db

Looking at the logs, on the good node I can see

 INFO [MigrationStage:1] 2012-09-24 10:08:16,511 Migration.java (line 119)
Applying migration c22413b0-062f-11e2--1bcb936807db Update column
family to org.apache.cassandra.config.CFMetaData@1dbdcde9
[cfId=1016,ksName=tok,cfName=tk_usus_user,cfType=Standard,comparator=org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type),subcolumncomparator=null,comment=,rowCacheSize=0.0,keyCacheSize=20.0,readRepairChance=1.0,replicateOnWrite=true,gcGraceSeconds=864000,defaultValidator=org.apache.cassandra.db.marshal.UTF8Type,keyValidator=org.apache.cassandra.db.marshal.UUIDType,minCompactionThreshold=4,maxCompactionThreshold=32,rowCacheSavePeriodInSeconds=0,keyCacheSavePeriodInSeconds=14400,rowCacheKeysToSave=2147483647,rowCacheProvider=org.apache.cassandra.cache.SerializingCacheProvider@3505231c,mergeShardsChance=0.1,keyAlias=java.nio.HeapByteBuffer[pos=485
lim=488 cap=653],column_metadata={},compactionStrategyClass=class
org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy,compactionStrategyOptions={},compressionOptions={sstable_compression=org.apache.cassandra.io.compress.SnappyCompressor,
chunk_length_kb=64},bloomFilterFpChance=null]

But same can be seen in the logs of the two other nodes:
 INFO [MigrationStage:1] 2012-09-24 10:08:16,767 Migration.java (line 119)
Applying migration c22413b0-062f-11e2--1bcb936807db Update column
family to org.apache.cassandra.config.CFMetaData@24fbb95d
[cfId=1016,ksName=tok,cfName=tk_usus_user,cfType=Standard,comparator=org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type),subcolumncomparator=null,comment=,rowCacheSize=0.0,keyCacheSize=20.0,readRepairChance=1.0,replicateOnWrite=true,gcGraceSeconds=864000,defaultValidator=org.apache.cassandra.db.marshal.UTF8Type,keyValidator=org.apache.cassandra.db.marshal.UUIDType,minCompactionThreshold=4,maxCompactionThreshold=32,rowCacheSavePeriodInSeconds=0,keyCacheSavePeriodInSeconds=14400,rowCacheKeysToSave=2147483647,rowCacheProvider=org.apache.cassandra.cache.SerializingCacheProvider@a469ba3,mergeShardsChance=0.1,keyAlias=java.nio.HeapByteBuffer[pos=0
lim=3 cap=3],column_metadata={},compactionStrategyClass=class
org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy,compactionStrategyOptions={},compressionOptions={sstable_compression=org.apache.cassandra.io.compress.SnappyCompressor,
chunk_length_kb=64},bloomFilterFpChance=null]

 INFO [MigrationStage:1] 2012-09-24 10:08:16,705 Migration.java (line 119)
Applying migration c22413b0-062f-11e2--1bcb936807db Update column
family to org.apache.cassandra.config.CFMetaData@216b6a58

Re: 1.1.5 Missing Insert! Strange Problem

2012-09-27 Thread Sylvain Lebresne
 I can verify the existence of the key that was inserted in Commitlogs of both 
 replicas however it seams that this record was never inserted.

Out of curiosity, how can you verify that?

--
Sylvain


RE: Data Modeling: Comments with Voting

2012-09-27 Thread Roshni Rajagopal

Hi Drew,
I think you have 4 requirements. Here are my suggestions.
a) store comments : have a static column family for comments with master data 
like created date, created by , length etcb) when a person votes for a comment, 
increment a vote counter : have a counter column family for incrementing the 
votes for each commentc) display comments sorted by date created: have a column 
family with a dummy row id  'sort_by_time_list',  column names can be date 
created(timeUUID), and column value can be comment id d) display comments 
sorted by number of votes: have a column family with a dummy row id 
'sort_by_votes_list' and column names can be a composite of number of votes , 
and comment id ( as more than 1 comment can have the same votes)

Regards,Roshni

 Date: Wed, 26 Sep 2012 17:36:13 -0700
 From: k...@mustardgrain.com
 To: user@cassandra.apache.org
 CC: d...@venarc.com
 Subject: Re: Data Modeling: Comments with Voting
 
 Depending on your needs, you could simply duplicate the comments in two 
 separate CFs with the column names including time in one and the vote in 
 the other. If you allow for updates to the comments, that would pose 
 some issues you'd need to solve at the app level.
 
 On 9/26/12 4:28 PM, Drew Kutcharian wrote:
  Hi Guys,
 
  Wondering what would be the best way to model a flat (no sub comments, i.e. 
  twitter) comments list with support for voting (where I can sort by create 
  time or votes) in Cassandra?
 
  To demonstrate:
 
  Sorted by create time:
  - comment 1 (5 votes)
  - comment 2 (1 votes)
  - comment 3 (no votes)
  - comment 4 (10 votes)
 
  Sorted by votes:
  - comment 4 (10 votes)
  - comment 1 (5 votes)
  - comment 2 (1 votes)
  - comment 3 (no votes)
 
  It's the sorted-by-votes that I'm having a bit of a trouble with. I'm 
  looking for a roll-your-own approach and prefer not to use secondary 
  indexes and CQL sorting.
 
  Thanks,
 
  Drew
 
 
  

Re: 1000's of column families

2012-09-27 Thread Robin Verlangen
Every CF adds some overhead (in memory) to each node. This is something you
should really keep in mind.

Best regards,

Robin Verlangen
*Software engineer*
*
*
W http://www.robinverlangen.nl
E ro...@us2.nl

http://goo.gl/Lt7BC

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.



2012/9/27 Sylvain Lebresne sylv...@datastax.com

 On Thu, Sep 27, 2012 at 12:13 AM, Hiller, Dean dean.hil...@nrel.gov
 wrote:
  We are streaming data with 1 stream per 1 CF and we have 1000's of CF.
  When using the tools they are all geared to analyzing ONE column family at
 a time :(.  If I remember correctly, Cassandra supports as many CF's as you
 want, correct?  Even though I am going to have tons of funs with
 limitations on the tools, correct?
 
  (I may end up wrapping the node tool with my own aggregate calls if
 needed to sum up multiple column families and such).

 Is there a non rhetorical question in there? Maybe is that a feature
 request in disguise?

 --
 Sylvain



Re: is node tool row count always way off?

2012-09-27 Thread Sylvain Lebresne
 The node tool cfstats, what is the row count estimate usually off by(what 
 percentage? Or what absolute number?)

It will likely not be very good but is supposed to give some order of
magnitude. That being said there is at least the following sources of
inaccuracies:
 - It counts deleted rows that have not been gced (i.e. that has been
deleted since less than gc_grace).
 - It estimate the number of rows for each sstable and sum that. If
you have a relatively low amount of rows that you overwrite a lot, it
means the estimate can be at least nb_sstables times more than reality
 - For an sstable, it estimate the number of rows by using the index
summary in memory. We know that this summary is one out of every 128
rows so we take the summary size and multiply by 128. But this means
we can +/- 128 error range when doing that. In practice, when you
don't have trivial amount of rows which is the common case, this is a
pretty good estimate. However, in your case you probably have 1 or 2
row per sstable, which each time yield one index bucket and is thus
count as 128 (and 128 * 3 = 384, the math adds up).

So don't rely too much on that number, but excluding the case of very
low amount of rows, and provided you're not too much behind on
compaction, it does give an order of magnitude.

 An SSTable of 3 sounds very weird….

It's not. SizeTieredCompaction never compact less than
min_compaction_threshold sstables, and that min_compaction_threshold
is 4 by default. That threshold is configurable but 4 is probably a
good value for any CF that has a non trivial write load.

--
Sylvain


cassandra key cache question

2012-09-27 Thread Tamar Fraenkel
Hi!
Is it possible that in JMX and cfstats the Key cache size is much bigger
than the number of keys in the CF?
Thanks,

*Tamar Fraenkel *
Senior Software Engineer, TOK Media

[image: Inline image 1]

ta...@tok-media.com
Tel:   +972 2 6409736
Mob:  +972 54 8356490
Fax:   +972 2 5612956
tokLogo.png

Re: downgrade from 1.1.4 to 1.0.X

2012-09-27 Thread Віталій Тимчишин
I suppose the way is to convert all SST to json, then install previous
version, convert back and load

2012/9/24 Arend-Jan Wijtzes ajwyt...@wise-guys.nl

 On Thu, Sep 20, 2012 at 10:13:49AM +1200, aaron morton wrote:
  No.
  They use different minor file versions which are not backwards
 compatible.

 Thanks Aaron.

 Is upgradesstables capable of downgrading the files to 1.0.8?
 Looking for a way to make this work.

 Regards,
 Arend-Jan


  On 18/09/2012, at 11:18 PM, Arend-Jan Wijtzes ajwyt...@wise-guys.nl
 wrote:
 
   Hi,
  
   We are running Cassandra 1.1.4 and like to experiment with
   Datastax Enterprise which uses 1.0.8. Can we safely downgrade
   a production cluster or is it incompatible? Any special steps
   involved?

 --
 Arend-Jan Wijtzes -- Wiseguys -- www.wise-guys.nl




-- 
Best regards,
 Vitalii Tymchyshyn


Re: cassandra key cache question

2012-09-27 Thread Tamar Fraenkel
Hi!
One more question:
I have couple of dropped column families, and in the JMX console I don't
see them under org.apache.cassandra.db.ColumnFamilies, *BUT *I do see them
under org.apache.cassandra.db.Caches, and the cache is not empty!
Does it mean that Cassandra still keep memory busy doing caching for a
non-existing column family? If so, how do I remove those caches?

Thanks!

*Tamar Fraenkel *
Senior Software Engineer, TOK Media

[image: Inline image 1]

ta...@tok-media.com
Tel:   +972 2 6409736
Mob:  +972 54 8356490
Fax:   +972 2 5612956





On Thu, Sep 27, 2012 at 11:45 AM, Tamar Fraenkel ta...@tok-media.comwrote:

 Hi!
 Is it possible that in JMX and cfstats the Key cache size is much bigger
 than the number of keys in the CF?
 Thanks,

 *Tamar Fraenkel *
 Senior Software Engineer, TOK Media

 [image: Inline image 1]

 ta...@tok-media.com
 Tel:   +972 2 6409736
 Mob:  +972 54 8356490
 Fax:   +972 2 5612956




tokLogo.pngtokLogo.png

Re: Once again, super columns or composites?

2012-09-27 Thread Hiller, Dean
Can you describe your use-case in detail as it might be easier to explain a 
model with composite names.
Later,
Dean

From: Edward Kibardin infa...@gmail.commailto:infa...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, September 27, 2012 4:02 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Once again, super columns or composites?

Sylvain, thanks for the response!

I have a use case which involves update of 1.5 millions of values a day.
Currently I'm just creating a new SSTable using SSTableWriter and uploading 
these SuperColunms to Cassandra.
But from my understanding, you just can't update composite column, only delete 
and insert... so this may make my update use case much more complicated.
It also not possible to add any sub-column to your composite, which mean we 
falling again to delete-insert case.
... and as I know, DynamicComposites is not recommended (and actually not 
supported by Pycassa).

Am I correct?

Ed


On Thu, Sep 27, 2012 at 9:28 AM, Sylvain Lebresne 
sylv...@datastax.commailto:sylv...@datastax.com wrote:
When people suggest composites instead of super columns, they mean
composite column 'names', not composite column 'values'. None of the
advantages you cite stand in the case of composite column 'names'.

--
Sylvain

On Wed, Sep 26, 2012 at 11:52 PM, Edward Kibardin 
infa...@gmail.commailto:infa...@gmail.com wrote:
 Hi Community,

 I know, I know... every one is claiming Super Columns are not good enough
 and it dangerous to use them now.
 But from my perspective, they have several very good advantages like:

 You are not fixed schema and always can add one more columns to subset of
 your supercolumns
 SuperColumn is loaded as whole if you requesting for at least one sub
 column, but it's the same as loading a whole composite value to get only one
 sub-value
 In supercolumns you can update only one subcolumn without touching other
 subcolumns, in case of composites you're unable to update just a portion of
 composite value.

 May be I do not understand composites correctly, but having very small
 supercolumns (10-15 subcolumns) I still think SuperColumns might be the best
 solution for me...
 In addition, building supercolumns with SSTableWriter is pretty much
 strait-forward for me, while it's not the case with composites...

 Any arguments?





Re: 1000's of column families

2012-09-27 Thread Hiller, Dean
Is there a non rhetorical question in there? Maybe is that a feature
request in disguise?


The question was basically, Is Cassandra ok with as many CF's as you want?
 It sounds like it is not based on the email that every CF causes a bit
more RAM to be used though.  So if cassandra is not ok with as many CF's
as you want, does anyone know what that limit would be for 16G of RAM or
something I could calculate with?

Thanks,
Dean

On 9/27/12 2:37 AM, Sylvain Lebresne sylv...@datastax.com wrote:

On Thu, Sep 27, 2012 at 12:13 AM, Hiller, Dean dean.hil...@nrel.gov
wrote:
 We are streaming data with 1 stream per 1 CF and we have 1000's of CF.
When using the tools they are all geared to analyzing ONE column family
at a time :(.  If I remember correctly, Cassandra supports as many CF's
as you want, correct?  Even though I am going to have tons of funs with
limitations on the tools, correct?

 (I may end up wrapping the node tool with my own aggregate calls if
needed to sum up multiple column families and such).

Is there a non rhetorical question in there? Maybe is that a feature
request in disguise?

--
Sylvain



Re: Once again, super columns or composites?

2012-09-27 Thread Sylvain Lebresne
 But from my understanding, you just can't update composite column, only
 delete and insert... so this may make my update use case much more
 complicated.

Let me try to sum things up.
In regular column families, a column (value) is defined by 2 keys: the
row key and the column name.
In super column families, a column (value) is defined by 3 keys: the
row key, the super column name and the column name.

So a super column is really just the set of columns that share the
same (row key, super column name) pair.

The idea of composite columns is to use regular columns, but simply to
distinguish multiple parts of the column name. So now if you take the
example of a CompositeType with 2 components. In that column family:
a column (value) is defined by 3 keys: the row key, the first
component of the column name and the second component of the column
name.

In other words, composites are a *generalization* of super columns and
super columns are the case of composites with 2 components. Except
that super columns are hard-wired in the cassandra code base in a way
that come with a number of limitation, the main one being that we
always deserialize a super column (again, which is just a set of
columns) in its entirety when we read it from disk.

So no, it's not true that  you just can't update composite column,
only delete and insert nor that it is not possible to add any
sub-column to your composite.

That being said, if you are using the thrift interface, super columns
do have a few perks currently:
  - the grouping of all the sub-columns composing a super columns is
hard-wired in Cassandra. The equivalent for composites, which consists
in grouping all columns having the same value for a given component,
must be done client side. Maybe some client library do that for you
but I'm not sure (I don't know for Pycassa for instance).
  - there is a few queries that can be easily done with super columns
that don't translate easily to composites, namely deleting whole super
columns and to a less extend querying multiple super columns by name.
That's due to a few limitations that upcoming versions of Cassandra
will solve but it's not the case with currently released versions.

The bottom line is: if you can do without those few perks, then you'd
better use composites since they have less limitations. If you can't
really do without these perks and can live with the super columns
limitations, then go on, use super columns. (And if you want the perks
without the limitations, wait for Cassandra 1.2 and use CQL3 :D)


 ... and as I know, DynamicComposites is not recommended (and actually not
 supported by Pycassa).

DynamicComposites don't do what you think they do. They do nothing
more than regular composite as far as comparing them to SuperColumns
is concerned, except giving you ways to shoot yourself in the foot.

--
Sylvain


Re: 1000's of column families

2012-09-27 Thread Marcelo Elias Del Valle
Out of curiosity, is it really necessary to have that amount of CFs?
I am probably still used to relational databases, where you would use a new
table just in case you need to store different kinds of data. As Cassandra
stores anything in each CF, it might probably make sense to have a lot of
CFs to store your data...
But why wouldn't you use a single CF with partitions in these case?
Wouldn't it be the same thing? I am asking because I might learn a new
modeling technique with the answer.

[]s

2012/9/26 Hiller, Dean dean.hil...@nrel.gov

 We are streaming data with 1 stream per 1 CF and we have 1000's of CF.
  When using the tools they are all geared to analyzing ONE column family at
 a time :(.  If I remember correctly, Cassandra supports as many CF's as you
 want, correct?  Even though I am going to have tons of funs with
 limitations on the tools, correct?

 (I may end up wrapping the node tool with my own aggregate calls if needed
 to sum up multiple column families and such).

 Thanks,
 Dean




-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr


Re:

2012-09-27 Thread Vivek Mishra
1 question.
user_cook_id, user_facebook_id, user_cell_phone, user_personal_id :
Combination key of all will be unique?  Or all of them are unique
individually.?

If a combination can be unique then a having extra column(index enabled)
per row  should work for you.

-Vivek


On Thu, Sep 27, 2012 at 7:22 PM, Andre Tavares andre...@gmail.com wrote:


 Hi community,

 I have a question: I need to do a search on a CF that has over 200 million
 rows to find an User key.

 To find the user, I have 4 keys (acctualy I have 4 keys but it that can
 increase) that are: user_cook_id, user_facebook_id, user_cell_phone,
 user_personal_id

 If I don't find the User by the informed key I need perform another query
 passing the others existing keys to find the user.

 My doubt:What is the better design to mine CF to find the user over the 4
 keys?   I thought to create an CF with secondary index  like this:

 create column family users_test with comparator=UTF8Type and
 column_metadata=[
 {column_name: user_cook_id, validation_class: UTF8Type, index_type: KEYS},
 {column_name: user_facebook_id, validation_class: UTF8Type, index_type:
 KEYS},
 {column_name: user_cell_phone, validation_class: UTF8Type, index_type:
 KEYS},
 {column_name: user_personal_id, validation_class: UTF8Type, index_type:
 KEYS},
 {column_name: user_key, validation_class: UTF8Type, index_type: KEYS}
 ];

 Another approaching is creating just one column for the User CF having
 generic KEY

 create column family users_test with comparator=UTF8Type and
 column_metadata=[
 {column_name: generic_key, validation_class: UTF8Type, index_type: KEYS},
 {column_name: user_key, validation_class: UTF8Type, index_type: KEYS}
 ];

 where generic_id can be: user_cook_id value, or a user_facebook_id,
 user_cell_phone, user_personal_id values ... the problem of this solution
 is that I have 200 million users_id x 4 keys (user_cook_id,
 user_facebook_id, user_cell_phone, user_personal_id) = 800 million rows

 I ask to my friends if am I on the right way or suggestions are well come
 .. thanks



Re: 1000's of column families

2012-09-27 Thread Hiller, Dean
We have 1000's of different building devices and we stream data from these 
devices.  The format and data from each one varies so one device has 
temperature at timeX with some other variables, another device has CO2 
percentage and other variables.  Every device is unique and streams it's own 
data.  We dynamically discover devices and register them.  Basically, one CF or 
table per thing really makes sense in this environment.  While we could try to 
find out which devices are similar, this would really be a pain and some 
devices add some new variable into the equation.  NOT only that but researchers 
can register new datasets and upload them as well and each dataset they have 
they do NOT want to share with other researches necessarily so we have security 
groups and each CF belongs to security groups.  We dynamically create CF's on 
the fly as people register new datasets.

On top of that, when the data sets get too large, we probably want to partition 
a single CF into time partitions.  We could create one CF and put all the data 
and have a partition per device, but then a time partition will contain 
multiple devices of data meaning we need to shrink our time partition size 
where if we have CF per device, the time partition can be larger as it is only 
for that one device.

THEN, on top of that, we have a meta CF for these devices so some people want 
to query for streams that match criteria AND which returns a CF name and they 
query that CF name so we almost need a query with variables like select cfName 
from Meta where x = y and then select * from cfName where x. Which we can 
do today.

Dean

From: Marcelo Elias Del Valle mvall...@gmail.commailto:mvall...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, September 27, 2012 8:01 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: 1000's of column families

Out of curiosity, is it really necessary to have that amount of CFs?
I am probably still used to relational databases, where you would use a new 
table just in case you need to store different kinds of data. As Cassandra 
stores anything in each CF, it might probably make sense to have a lot of CFs 
to store your data...
But why wouldn't you use a single CF with partitions in these case? Wouldn't it 
be the same thing? I am asking because I might learn a new modeling technique 
with the answer.

[]s

2012/9/26 Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov
We are streaming data with 1 stream per 1 CF and we have 1000's of CF.  When 
using the tools they are all geared to analyzing ONE column family at a time 
:(.  If I remember correctly, Cassandra supports as many CF's as you want, 
correct?  Even though I am going to have tons of funs with limitations on the 
tools, correct?

(I may end up wrapping the node tool with my own aggregate calls if needed to 
sum up multiple column families and such).

Thanks,
Dean



--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr


Re:

2012-09-27 Thread Andre Tavares
user_cook_id, user_facebook_id, user_cell_phone, user_personal_id :
Combination key of all will be unique?  Or all of them are unique
individually.?

Combination key of all will be unique?

no ...


Or all of them are unique individually.?
 yes ... all them are unique individually


2012/9/27 Vivek Mishra mishra.v...@gmail.com

 1 question.
 user_cook_id, user_facebook_id, user_cell_phone, user_personal_id :
 Combination key of all will be unique?  Or all of them are unique
 individually.?

 If a combination can be unique then a having extra column(index enabled)
 per row  should work for you.

 -Vivek



 On Thu, Sep 27, 2012 at 7:22 PM, Andre Tavares andre...@gmail.com wrote:


 Hi community,

 I have a question: I need to do a search on a CF that has over 200
 million rows to find an User key.

 To find the user, I have 4 keys (acctualy I have 4 keys but it that can
 increase) that are: user_cook_id, user_facebook_id, user_cell_phone,
 user_personal_id

 If I don't find the User by the informed key I need perform another query
 passing the others existing keys to find the user.

 My doubt:What is the better design to mine CF to find the user over the 4
 keys?   I thought to create an CF with secondary index  like this:

 create column family users_test with comparator=UTF8Type and
 column_metadata=[
 {column_name: user_cook_id, validation_class: UTF8Type, index_type: KEYS},
 {column_name: user_facebook_id, validation_class: UTF8Type, index_type:
 KEYS},
 {column_name: user_cell_phone, validation_class: UTF8Type, index_type:
 KEYS},
 {column_name: user_personal_id, validation_class: UTF8Type, index_type:
 KEYS},
 {column_name: user_key, validation_class: UTF8Type, index_type: KEYS}
 ];

 Another approaching is creating just one column for the User CF having
 generic KEY

 create column family users_test with comparator=UTF8Type and
 column_metadata=[
 {column_name: generic_key, validation_class: UTF8Type, index_type: KEYS},
 {column_name: user_key, validation_class: UTF8Type, index_type: KEYS}
 ];

 where generic_id can be: user_cook_id value, or a user_facebook_id,
 user_cell_phone, user_personal_id values ... the problem of this solution
 is that I have 200 million users_id x 4 keys (user_cook_id,
 user_facebook_id, user_cell_phone, user_personal_id) = 800 million rows

 I ask to my friends if am I on the right way or suggestions are well come
 .. thanks





Re: 1000's of column families

2012-09-27 Thread Marcelo Elias Del Valle
Dean,

 I was used, in the relational world, to use hibernate and O/R mapping.
There were times when I used 3 classes (2 inheriting from 1 another) and
mapped all of the to 1 table. The common part was in the super class and
each sub class had it's own columns. The table, however, use to have all
the columns and this design was hard because of that, as creating more
subclasses would need changes in the table.
 However, if you use playOrm and if playOrm has/had a feature to allow
inheritance mapping to a CF, it would solve your problem, wouldn't it? Of
course it is probably much harder than it might problably appear... :D

Best regards,
Marcelo Valle.

2012/9/27 Hiller, Dean dean.hil...@nrel.gov

 We have 1000's of different building devices and we stream data from these
 devices.  The format and data from each one varies so one device has
 temperature at timeX with some other variables, another device has CO2
 percentage and other variables.  Every device is unique and streams it's
 own data.  We dynamically discover devices and register them.  Basically,
 one CF or table per thing really makes sense in this environment.  While we
 could try to find out which devices are similar, this would really be a
 pain and some devices add some new variable into the equation.  NOT only
 that but researchers can register new datasets and upload them as well and
 each dataset they have they do NOT want to share with other researches
 necessarily so we have security groups and each CF belongs to security
 groups.  We dynamically create CF's on the fly as people register new
 datasets.

 On top of that, when the data sets get too large, we probably want to
 partition a single CF into time partitions.  We could create one CF and put
 all the data and have a partition per device, but then a time partition
 will contain multiple devices of data meaning we need to shrink our time
 partition size where if we have CF per device, the time partition can be
 larger as it is only for that one device.

 THEN, on top of that, we have a meta CF for these devices so some people
 want to query for streams that match criteria AND which returns a CF name
 and they query that CF name so we almost need a query with variables like
 select cfName from Meta where x = y and then select * from cfName where
 x. Which we can do today.

 Dean

 From: Marcelo Elias Del Valle mvall...@gmail.commailto:
 mvall...@gmail.com
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Thursday, September 27, 2012 8:01 AM
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Subject: Re: 1000's of column families

 Out of curiosity, is it really necessary to have that amount of CFs?
 I am probably still used to relational databases, where you would use a
 new table just in case you need to store different kinds of data. As
 Cassandra stores anything in each CF, it might probably make sense to have
 a lot of CFs to store your data...
 But why wouldn't you use a single CF with partitions in these case?
 Wouldn't it be the same thing? I am asking because I might learn a new
 modeling technique with the answer.

 []s

 2012/9/26 Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov
 We are streaming data with 1 stream per 1 CF and we have 1000's of CF.
  When using the tools they are all geared to analyzing ONE column family at
 a time :(.  If I remember correctly, Cassandra supports as many CF's as you
 want, correct?  Even though I am going to have tons of funs with
 limitations on the tools, correct?

 (I may end up wrapping the node tool with my own aggregate calls if needed
 to sum up multiple column families and such).

 Thanks,
 Dean



 --
 Marcelo Elias Del Valle
 http://mvalle.com - @mvallebr




-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr


Re:

2012-09-27 Thread Vivek Mishra
So it means going by secondary index way, still you  can hold unique
combination key per row. If any of these keys are not present then it will
not be part of that combination key. and everytime you will get a unique
value for each row. That can definitly avoid duplicate rows.

Or even you can make that combination key as a row key as well.

That can be one of the alternatives, still i think there are ways can be
worked out.


-Vivek

On Thu, Sep 27, 2012 at 7:51 PM, Andre Tavares andre...@gmail.com wrote:

 user_cook_id, user_facebook_id, user_cell_phone, user_personal_id :
 Combination key of all will be unique?  Or all of them are unique
 individually.?

 Combination key of all will be unique?

 no ...



 Or all of them are unique individually.?
  yes ... all them are unique individually



 2012/9/27 Vivek Mishra mishra.v...@gmail.com

 1 question.
 user_cook_id, user_facebook_id, user_cell_phone, user_personal_id :
 Combination key of all will be unique?  Or all of them are unique
 individually.?

 If a combination can be unique then a having extra column(index enabled)
 per row  should work for you.

 -Vivek



 On Thu, Sep 27, 2012 at 7:22 PM, Andre Tavares andre...@gmail.comwrote:


 Hi community,

 I have a question: I need to do a search on a CF that has over 200
 million rows to find an User key.

 To find the user, I have 4 keys (acctualy I have 4 keys but it that can
 increase) that are: user_cook_id, user_facebook_id, user_cell_phone,
 user_personal_id

 If I don't find the User by the informed key I need perform another
 query passing the others existing keys to find the user.

 My doubt:What is the better design to mine CF to find the user over the
 4 keys?   I thought to create an CF with secondary index  like this:

 create column family users_test with comparator=UTF8Type and
 column_metadata=[
 {column_name: user_cook_id, validation_class: UTF8Type, index_type:
 KEYS},
 {column_name: user_facebook_id, validation_class: UTF8Type, index_type:
 KEYS},
 {column_name: user_cell_phone, validation_class: UTF8Type, index_type:
 KEYS},
 {column_name: user_personal_id, validation_class: UTF8Type, index_type:
 KEYS},
 {column_name: user_key, validation_class: UTF8Type, index_type: KEYS}
 ];

 Another approaching is creating just one column for the User CF having
 generic KEY

 create column family users_test with comparator=UTF8Type and
 column_metadata=[
 {column_name: generic_key, validation_class: UTF8Type, index_type: KEYS},
 {column_name: user_key, validation_class: UTF8Type, index_type: KEYS}
 ];

 where generic_id can be: user_cook_id value, or a user_facebook_id,
 user_cell_phone, user_personal_id values ... the problem of this solution
 is that I have 200 million users_id x 4 keys (user_cook_id,
 user_facebook_id, user_cell_phone, user_personal_id) = 800 million rows

 I ask to my friends if am I on the right way or suggestions are well
 come .. thanks






Re: 1000's of column families

2012-09-27 Thread Hiller, Dean
PlayOrm DOES support inheritance mapping but only supports single table right 
now.  In fact, DboColumnMeta.java  has 4 subclasses that all map to that one 
ColumnFamily so we already support and heavily use the inheritance feature.

That said, I am more concerned with scalability.  The more you stuff into a 
table, the more partitions you need….as an example, I really have a choice

Have this in a partition
device1 datapoint1
device2 datapoint1
device1 datapoint2
device2 datapoint2
device1 datapoint3

OR have just this in a partition
device1 datapoint1
device1 datapoint1
device1 datapoint1

If I use the latter approach, I can have more points for device1 in one 
partition.  I could use inheritance but then I can't fit as many data points 
for device 1 in a partition.

Does that make more sense?

Later,
Dean


From: Marcelo Elias Del Valle mvall...@gmail.commailto:mvall...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, September 27, 2012 8:45 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: 1000's of column families

Dean,

 I was used, in the relational world, to use hibernate and O/R mapping. 
There were times when I used 3 classes (2 inheriting from 1 another) and mapped 
all of the to 1 table. The common part was in the super class and each sub 
class had it's own columns. The table, however, use to have all the columns and 
this design was hard because of that, as creating more subclasses would need 
changes in the table.
 However, if you use playOrm and if playOrm has/had a feature to allow 
inheritance mapping to a CF, it would solve your problem, wouldn't it? Of 
course it is probably much harder than it might problably appear... :D

Best regards,
Marcelo Valle.

2012/9/27 Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov
We have 1000's of different building devices and we stream data from these 
devices.  The format and data from each one varies so one device has 
temperature at timeX with some other variables, another device has CO2 
percentage and other variables.  Every device is unique and streams it's own 
data.  We dynamically discover devices and register them.  Basically, one CF or 
table per thing really makes sense in this environment.  While we could try to 
find out which devices are similar, this would really be a pain and some 
devices add some new variable into the equation.  NOT only that but researchers 
can register new datasets and upload them as well and each dataset they have 
they do NOT want to share with other researches necessarily so we have security 
groups and each CF belongs to security groups.  We dynamically create CF's on 
the fly as people register new datasets.

On top of that, when the data sets get too large, we probably want to partition 
a single CF into time partitions.  We could create one CF and put all the data 
and have a partition per device, but then a time partition will contain 
multiple devices of data meaning we need to shrink our time partition size 
where if we have CF per device, the time partition can be larger as it is only 
for that one device.

THEN, on top of that, we have a meta CF for these devices so some people want 
to query for streams that match criteria AND which returns a CF name and they 
query that CF name so we almost need a query with variables like select cfName 
from Meta where x = y and then select * from cfName where x. Which we can 
do today.

Dean

From: Marcelo Elias Del Valle 
mvall...@gmail.commailto:mvall...@gmail.commailto:mvall...@gmail.commailto:mvall...@gmail.com
Reply-To: 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, September 27, 2012 8:01 AM
To: 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: 1000's of column families

Out of curiosity, is it really necessary to have that amount of CFs?
I am probably still used to relational databases, where you would use a new 
table just in case you need to store different kinds of data. As Cassandra 
stores anything in each CF, it might probably make sense to have a lot of CFs 
to store your data...
But why wouldn't you use a single CF with partitions in these case? Wouldn't it 
be the same thing? I am asking because I might learn a new modeling technique 
with the answer.

[]s

2012/9/26 Hiller, Dean 
dean.hil...@nrel.govmailto:dean.hil...@nrel.govmailto:dean.hil...@nrel.govmailto:dean.hil...@nrel.gov
We are streaming data with 1 stream per 

Re: Why data tripled in size after repair?

2012-09-27 Thread Sylvain Lebresne
 I don't understand why it copied data twice. In worst case scenario it
 should copy everything (~90G)

Sadly no, repair is currently peer-to-peer based (there is a ticket to
fix it: https://issues.apache.org/jira/browse/CASSANDRA-3200, but
that's not trivial). This mean that you can end up with RF times the
data after a repair. Obviously that should be a worst case scenario as
it implies everything is repaired, but at least the triplicate part is
a problem, but a know and not so easy to fix one.

Is it possible that each time you've ran repair, one of the node in
the cluster was very out of sync with the other nodes. Maybe a node
that has crashed for a long time?

--
Sylvain


Re: Why data tripled in size after repair?

2012-09-27 Thread Andrey Ilinykh
On Thu, Sep 27, 2012 at 9:52 AM, Sylvain Lebresne sylv...@datastax.com wrote:
 I don't understand why it copied data twice. In worst case scenario it
 should copy everything (~90G)

 Sadly no, repair is currently peer-to-peer based (there is a ticket to
 fix it: https://issues.apache.org/jira/browse/CASSANDRA-3200, but
 that's not trivial). This mean that you can end up with RF times the
 data after a repair. Obviously that should be a worst case scenario as
 it implies everything is repaired, but at least the triplicate part is
 a problem, but a know and not so easy to fix one.

I see. It explains why I get 85G + 85G instead of 90G. But after next
repair I have six extra files 75G each,
how is it possible? It looks like repair is done per sstable, not CF.
Is it possible?


 Is it possible that each time you've ran repair, one of the node in
 the cluster was very out of sync with the other nodes. Maybe a node
 that has crashed for a long time?

No, nodes go down time to time (OOM), but I restart them
automatically. But my specific is - I have order preserved partitioner
and update intensively every 5th or 10th row.
As far as I understand, because of that when Merklee tree is
calculated, in every range I have several hot rows.  These rows are
good candidates to be inconsistant. There is one thing I don't
understand. Does Merklee tree calculation algorithm use sstables
flushed on hard drive or it uses mem tables also?
Let's say I have hot row which sits in memory in one node but
flushed out in another. Is the any difference in Merklee trees?

Thank you,
  Andrey


Re:

2012-09-27 Thread Vivek Mishra
Yes.

On Thu, Sep 27, 2012 at 10:25 PM, Marcelo Elias Del Valle 
mvall...@gmail.com wrote:



 2012/9/27 Vivek Mishra mishra.v...@gmail.com

 So it means going by secondary index way,


 Out of curiosity, how would you index it in this case? 1 row key for each
 combination, with no fields in the combination repeating...


 --
 Marcelo Elias Del Valle
 http://mvalle.com - @mvallebr



Re: 1000's of column families

2012-09-27 Thread Edward Capriolo
Hector also offers support for 'Virtual Keyspaces' which you might
want to look at.


On Thu, Sep 27, 2012 at 1:10 PM, Aaron Turner synfina...@gmail.com wrote:
 On Thu, Sep 27, 2012 at 3:11 PM, Hiller, Dean dean.hil...@nrel.gov wrote:
 We have 1000's of different building devices and we stream data from these 
 devices.  The format and data from each one varies so one device has 
 temperature at timeX with some other variables, another device has CO2 
 percentage and other variables.  Every device is unique and streams it's own 
 data.  We dynamically discover devices and register them.  Basically, one CF 
 or table per thing really makes sense in this environment.  While we could 
 try to find out which devices are similar, this would really be a pain and 
 some devices add some new variable into the equation.  NOT only that but 
 researchers can register new datasets and upload them as well and each 
 dataset they have they do NOT want to share with other researches 
 necessarily so we have security groups and each CF belongs to security 
 groups.  We dynamically create CF's on the fly as people register new 
 datasets.

 On top of that, when the data sets get too large, we probably want to 
 partition a single CF into time partitions.  We could create one CF and put 
 all the data and have a partition per device, but then a time partition will 
 contain multiple devices of data meaning we need to shrink our time 
 partition size where if we have CF per device, the time partition can be 
 larger as it is only for that one device.

 THEN, on top of that, we have a meta CF for these devices so some people 
 want to query for streams that match criteria AND which returns a CF name 
 and they query that CF name so we almost need a query with variables like 
 select cfName from Meta where x = y and then select * from cfName where 
 x. Which we can do today.

 How strict are your security requirements?  If it wasn't for that,
 you'd be much better off storing data on a per-statistic basis then
 per-device.  Hell, you could store everything in a single CF by using
 a composite row key:

 devicename|stat type|instance

 But yeah, there isn't a hard limit for the number of CF's, but there
 is overhead associated with each one and so I wouldn't consider your
 design as scalable.  Generally speaking, hundreds are ok, but
 thousands is pushing it.



 --
 Aaron Turner
 http://synfin.net/ Twitter: @synfinatic
 http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix  
 Windows
 Those who would give up essential Liberty, to purchase a little temporary
 Safety, deserve neither Liberty nor Safety.
 -- Benjamin Franklin
 carpe diem quam minimum credula postero


Re: Why data tripled in size after repair?

2012-09-27 Thread Sylvain Lebresne
 I see. It explains why I get 85G + 85G instead of 90G. But after next
 repair I have six extra files 75G each,
 how is it possible?

Maybe you've run repair on other nodes? Basically repair is a fairly
blind process. If it consider that a given range (and by range I mean
here the ones that repair hashes, so they're suppose to be small
ranges) is inconsistent between two peers, it doesn't know which peers
is up to date (and in fact it cannot, it is possible that no node is
more up to date than the other on the whole range, even if said range
is small, at least in theory).

 It looks like repair is done per sstable, not CF. Is it possible?

No, repair is done on the CF, not on individual sstables.


 Does Merklee tree calculation algorithm use sstables flushed on hard drive or 
 it uses mem tables also?

It triggers a flush, wait for it, and then use the on-disk sstables.
So in theory, since the order to flush is sent to each replica at the
same time, all the replica will trigger the flush in a very short
interval and thus consider data set that only from that short
interval. So in a cluster with a high write load, it is expected that
a bit of the inconsistence is due to that short interval, but it
should be relatively negligible. That's the theory at least. But
clearly your case (ordered partitioner with intensive update covering
the whole range) does maximize that inconsistency. Still, it shouldn't
be that dramatic.

However your use of the ordered partitionner might not be
insignificant as it's much less used and repair does have a few
specific bits for it. Do you mind opening a ticket on jira with a
summary of your configuration/problem? I'll look if I can spot
something wrong relating to the order partitioner but in any that'll
be simpler to track what's wrong there.

--
Sylvain


Re: 1000's of column families

2012-09-27 Thread Aaron Turner
On Thu, Sep 27, 2012 at 7:35 PM, Marcelo Elias Del Valle
mvall...@gmail.com wrote:


 2012/9/27 Aaron Turner synfina...@gmail.com

 How strict are your security requirements?  If it wasn't for that,
 you'd be much better off storing data on a per-statistic basis then
 per-device.  Hell, you could store everything in a single CF by using
 a composite row key:

 devicename|stat type|instance

 But yeah, there isn't a hard limit for the number of CF's, but there
 is overhead associated with each one and so I wouldn't consider your
 design as scalable.  Generally speaking, hundreds are ok, but
 thousands is pushing it.


 Aaron,

 Imagine that instead of using a composite key in this case, you use a
 simple row key instance_uuid. Then, to index data by devicename |
 start_type|instance you use another CF with this composite key or several
 CFs to index it.
 Do you see any drawbacks in terms of performance?

Really that depends on the client side I think.  Ideally, you'd like
to have the client be able to be able to directly access the row by
name without looking it up in some index.  Basically if you have to
lookup up the instance_uuid that's another call to some datastore
which takes more time then generating the row key via it's composites.
 At least that's my opinion...

Of course there are times where using an instance_uuid makes a lot of
sense... like if you rename a device and want all your stats to move
to the new name.  Much easier to just update the mapping record then
reading  rewriting all your rows for that device.

In my project, we use a device_uuid (just a primary key stored in an
Oracle DB... long story!), but everything else is by name in our
composite row keys.


-- 
Aaron Turner
http://synfin.net/ Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix  Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
-- Benjamin Franklin
carpe diem quam minimum credula postero


Re: 1000's of column families

2012-09-27 Thread Hiller, Dean
Unfortunately, the security aspect is very strict.  Some make their data
public but there are many projects where due to client contracts, they
cannot make their data public within our company(ie. Other groups in our
company are not allowed to see the data).

Also, currently, we have researchers upload their own datasets as well.
Ideally, they want to see this one noSQL store as the place where all data
for the company livesŠALL of it so if you add up all the applications
which would be huge and then all the tables which is large, it just keeps
growing.  It is a very nice concept(all data in one location), though we
will see how implementing it goes.

How much overhead per column family in RAM?  So far we have around 4000
Cfs with no issue that I see yet.

Dean

On 9/27/12 11:10 AM, Aaron Turner synfina...@gmail.com wrote:

On Thu, Sep 27, 2012 at 3:11 PM, Hiller, Dean dean.hil...@nrel.gov
wrote:
 We have 1000's of different building devices and we stream data from
these devices.  The format and data from each one varies so one device
has temperature at timeX with some other variables, another device has
CO2 percentage and other variables.  Every device is unique and streams
it's own data.  We dynamically discover devices and register them.
Basically, one CF or table per thing really makes sense in this
environment.  While we could try to find out which devices are
similar, this would really be a pain and some devices add some new
variable into the equation.  NOT only that but researchers can register
new datasets and upload them as well and each dataset they have they do
NOT want to share with other researches necessarily so we have security
groups and each CF belongs to security groups.  We dynamically create
CF's on the fly as people register new datasets.

 On top of that, when the data sets get too large, we probably want to
partition a single CF into time partitions.  We could create one CF and
put all the data and have a partition per device, but then a time
partition will contain multiple devices of data meaning we need to
shrink our time partition size where if we have CF per device, the time
partition can be larger as it is only for that one device.

 THEN, on top of that, we have a meta CF for these devices so some
people want to query for streams that match criteria AND which returns a
CF name and they query that CF name so we almost need a query with
variables like select cfName from Meta where x = y and then select *
from cfName where x. Which we can do today.

How strict are your security requirements?  If it wasn't for that,
you'd be much better off storing data on a per-statistic basis then
per-device.  Hell, you could store everything in a single CF by using
a composite row key:

devicename|stat type|instance

But yeah, there isn't a hard limit for the number of CF's, but there
is overhead associated with each one and so I wouldn't consider your
design as scalable.  Generally speaking, hundreds are ok, but
thousands is pushing it.



-- 
Aaron Turner
http://synfin.net/ Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix 
Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
-- Benjamin Franklin
carpe diem quam minimum credula postero



Re: pig and widerows

2012-09-27 Thread William Oberman
The next painful lesson for me was figuring out how to get logging working
for a distributed hadoop process.   In my test environment, I have a single
node that runs name/secondaryname/data/job trackers (call it central),
and I have two cassandra nodes running tasktrackers.  But, I also have
cassandra libraries on the central box, and invoke my pig script from
there.   I had been patching and recompiling cassandra (1.1.5 with my
logging, and the system env fix) on that central box, and SOME of the
logging was appearing in the pig output.  But, eventually I decided to move
that recompiled code to the tasktracker boxes, and then I found even more
of the logging I had added in:
/var/log/hadoop/userlogs/JOB_ID
on each of the tasktrackers.

Based on this new logging, I found out that the widerows setting wasn't
propagating from the central box to the tasktrackers.  I added:
export PIG_WIDEROW_INPUT=true
To hadoop-env.sh on each of the tasktrackers and it finally worked!

So, long story short, to actually get all columns for a key I had to:
1.) patch 1.1.5 to honor the PIG_WIDEROW_INPUT=true system setting
2.) add the system setting to ALL nodes in the hadoop cluster

I'm going to try to undo all of my other hacks to get logging/printing
working to confirm if those were actually the only two changes I had to
make.

will

On Thu, Sep 27, 2012 at 1:43 PM, William Oberman
ober...@civicscience.comwrote:

 Ok, this is painful.  The first problem I found is in stock 1.1.5 there is
 no way to set widerows to true!  The new widerows URI parsing is NOT in
 1.1.5.  And for extra fun, getting the value from the system property is
 BROKEN (at least in my centos linux environment).

 Here are the key lines of code (in CassandraStorage), note the different
 ways of getting the property!  getenv in the test, and getProperty in the
 set:
 widerows = DEFAULT_WIDEROW_INPUT;
 if (System.getenv(PIG_WIDEROW_INPUT) != null)
 widerows =
 Boolean.valueOf(System.getProperty(PIG_WIDEROW_INPUT));

 I added this logging:
 logger.warn(widerows =  + widerows +  getenv= +
 System.getenv(PIG_WIDEROW_INPUT) + 
 getProp=+System.getProperty(PIG_WIDEROW_INPUT));

 And I saw:
 org.apache.cassandra.hadoop.pig.CassandraStorage - widerows = false
 getenv=true getProp=null
 So for me getProperty != getenv :-(

 For people trying to figure out how to debug cassandra + hadoop + pig, for
 me the key to get debugging and logging working was to focus on
 /etc/hadoop/conf (not /etc/pig/conf as I expected).

 Also, if you want to compile your own cassandra (to add logging messages),
 make sure it's appears first on the pig classpath (use pig -secretDebugCmd
 to see the fully qualified command line).

 The next thing I'm trying to figure out is why when widerows == true I'm
 STILL not seeing more than 1024 columns :-(

 will

 On Wed, Sep 26, 2012 at 3:42 PM, William Oberman ober...@civicscience.com
  wrote:

 Hi,

 I'm trying to figure out what's going on with my cassandra/hadoop/pig
 system.  I created a mini copy of my main cassandra data by randomly
 subsampling to get ~50,000 keys.  I was then writing pig scripts but also
 the equivalent operation using simple single threaded code to double check
 pig.

 Of course my very first test failed.  After doing a pig DUMP on the raw
 data, what appears to be happening is I'm only getting the first 1024
 columns of a key.  After some googling, this seems to be known behavior
 unless you add ?widerows=true to the pig load URI. I tried this, but
 it didn't seem to fix anything :-(   Here's the the start of my pig script:
 foo = LOAD 'cassandra://KEYSPACE/COLUMN_FAMILY?widerows=true' USING
 CassandraStorage() AS (key:chararray, columns:bag {column:tuple (name,
 value)});

 I'm using cassandra 1.1.5 from datastax rpms.  I'm using hadoop
 (0.20.2+923.418-1) and pig (0.8.1+28.39-1) from cloudera rpms.

 What am I doing wrong?  Or, how I can enable debugging/logging to next
 figure out what is going on?  I haven't had to debug hadoop+pig+cassandra
 much, other than doing DUMP/ILLUSTRATE from pig.

 will






Re: pig and widerows

2012-09-27 Thread William Oberman
I don't want to switch my cassandra to HEAD, but looking at the newest code
for CassandraStorage, I'm concerned the Uri parsing for widerows isn't
going to work.  setLocation first calls setLocationFromUri (which sets
widerows to the Uri value), but then sets widerows to a static value (which
is defined as false), and then it sets widerows to the system setting if it
exists.  That doesn't seem right...  ?

But setLocationFromUri also gets called from setStoreLocation, and I don't
really know the difference between setLocation and setStoreLocation in
terms of what is going on in terms of the integration between
cassandra/pig/hadoop.

will

On Thu, Sep 27, 2012 at 3:26 PM, William Oberman
ober...@civicscience.comwrote:

 The next painful lesson for me was figuring out how to get logging working
 for a distributed hadoop process.   In my test environment, I have a single
 node that runs name/secondaryname/data/job trackers (call it central),
 and I have two cassandra nodes running tasktrackers.  But, I also have
 cassandra libraries on the central box, and invoke my pig script from
 there.   I had been patching and recompiling cassandra (1.1.5 with my
 logging, and the system env fix) on that central box, and SOME of the
 logging was appearing in the pig output.  But, eventually I decided to move
 that recompiled code to the tasktracker boxes, and then I found even more
 of the logging I had added in:
 /var/log/hadoop/userlogs/JOB_ID
 on each of the tasktrackers.

 Based on this new logging, I found out that the widerows setting wasn't
 propagating from the central box to the tasktrackers.  I added:
 export PIG_WIDEROW_INPUT=true
 To hadoop-env.sh on each of the tasktrackers and it finally worked!

 So, long story short, to actually get all columns for a key I had to:
 1.) patch 1.1.5 to honor the PIG_WIDEROW_INPUT=true system setting
 2.) add the system setting to ALL nodes in the hadoop cluster

 I'm going to try to undo all of my other hacks to get logging/printing
 working to confirm if those were actually the only two changes I had to
 make.

 will


 On Thu, Sep 27, 2012 at 1:43 PM, William Oberman ober...@civicscience.com
  wrote:

 Ok, this is painful.  The first problem I found is in stock 1.1.5 there
 is no way to set widerows to true!  The new widerows URI parsing is NOT in
 1.1.5.  And for extra fun, getting the value from the system property is
 BROKEN (at least in my centos linux environment).

 Here are the key lines of code (in CassandraStorage), note the different
 ways of getting the property!  getenv in the test, and getProperty in the
 set:
 widerows = DEFAULT_WIDEROW_INPUT;
 if (System.getenv(PIG_WIDEROW_INPUT) != null)
 widerows =
 Boolean.valueOf(System.getProperty(PIG_WIDEROW_INPUT));

 I added this logging:
 logger.warn(widerows =  + widerows +  getenv= +
 System.getenv(PIG_WIDEROW_INPUT) + 
 getProp=+System.getProperty(PIG_WIDEROW_INPUT));

 And I saw:
 org.apache.cassandra.hadoop.pig.CassandraStorage - widerows = false
 getenv=true getProp=null
 So for me getProperty != getenv :-(

 For people trying to figure out how to debug cassandra + hadoop + pig,
 for me the key to get debugging and logging working was to focus on
 /etc/hadoop/conf (not /etc/pig/conf as I expected).

 Also, if you want to compile your own cassandra (to add logging
 messages), make sure it's appears first on the pig classpath (use pig
 -secretDebugCmd to see the fully qualified command line).

 The next thing I'm trying to figure out is why when widerows == true I'm
 STILL not seeing more than 1024 columns :-(

 will

 On Wed, Sep 26, 2012 at 3:42 PM, William Oberman 
 ober...@civicscience.com wrote:

 Hi,

 I'm trying to figure out what's going on with my cassandra/hadoop/pig
 system.  I created a mini copy of my main cassandra data by randomly
 subsampling to get ~50,000 keys.  I was then writing pig scripts but also
 the equivalent operation using simple single threaded code to double check
 pig.

 Of course my very first test failed.  After doing a pig DUMP on the raw
 data, what appears to be happening is I'm only getting the first 1024
 columns of a key.  After some googling, this seems to be known behavior
 unless you add ?widerows=true to the pig load URI. I tried this, but
 it didn't seem to fix anything :-(   Here's the the start of my pig script:
 foo = LOAD 'cassandra://KEYSPACE/COLUMN_FAMILY?widerows=true' USING
 CassandraStorage() AS (key:chararray, columns:bag {column:tuple (name,
 value)});

 I'm using cassandra 1.1.5 from datastax rpms.  I'm using hadoop
 (0.20.2+923.418-1) and pig (0.8.1+28.39-1) from cloudera rpms.

 What am I doing wrong?  Or, how I can enable debugging/logging to next
 figure out what is going on?  I haven't had to debug hadoop+pig+cassandra
 much, other than doing DUMP/ILLUSTRATE from pig.

 will






Re: 1.1.5 Missing Insert! Strange Problem

2012-09-27 Thread Arya Goudarzi
Thanks for your reply. I did grep on the commit logs for the offending key
and grep showed Binary file matches. I am trying to use this tool to
extract the commitlog and actually confirm if the mutation was a write:

https://github.com/carloscm/cassandra-commitlog-extract.git


On Thu, Sep 27, 2012 at 1:45 AM, Sylvain Lebresne sylv...@datastax.comwrote:

  I can verify the existence of the key that was inserted in Commitlogs of
 both replicas however it seams that this record was never inserted.

 Out of curiosity, how can you verify that?

 --
 Sylvain



Re: Once again, super columns or composites?

2012-09-27 Thread Edward Kibardin
Oh... Sylvain, thanks a lot for such a complete answer.

Yeah, I understand my mistake in suggestions regarding composites.
It seems, composites are pretty much an advanced version of key manual
joining into a string column name: key1:key2

Thanks a lot!
Ed

On Thu, Sep 27, 2012 at 2:02 PM, Sylvain Lebresne sylv...@datastax.comwrote:

  But from my understanding, you just can't update composite column, only
  delete and insert... so this may make my update use case much more
  complicated.

 Let me try to sum things up.
 In regular column families, a column (value) is defined by 2 keys: the
 row key and the column name.
 In super column families, a column (value) is defined by 3 keys: the
 row key, the super column name and the column name.

 So a super column is really just the set of columns that share the
 same (row key, super column name) pair.

 The idea of composite columns is to use regular columns, but simply to
 distinguish multiple parts of the column name. So now if you take the
 example of a CompositeType with 2 components. In that column family:
 a column (value) is defined by 3 keys: the row key, the first
 component of the column name and the second component of the column
 name.

 In other words, composites are a *generalization* of super columns and
 super columns are the case of composites with 2 components. Except
 that super columns are hard-wired in the cassandra code base in a way
 that come with a number of limitation, the main one being that we
 always deserialize a super column (again, which is just a set of
 columns) in its entirety when we read it from disk.

 So no, it's not true that  you just can't update composite column,
 only delete and insert nor that it is not possible to add any
 sub-column to your composite.

 That being said, if you are using the thrift interface, super columns
 do have a few perks currently:
   - the grouping of all the sub-columns composing a super columns is
 hard-wired in Cassandra. The equivalent for composites, which consists
 in grouping all columns having the same value for a given component,
 must be done client side. Maybe some client library do that for you
 but I'm not sure (I don't know for Pycassa for instance).
   - there is a few queries that can be easily done with super columns
 that don't translate easily to composites, namely deleting whole super
 columns and to a less extend querying multiple super columns by name.
 That's due to a few limitations that upcoming versions of Cassandra
 will solve but it's not the case with currently released versions.

 The bottom line is: if you can do without those few perks, then you'd
 better use composites since they have less limitations. If you can't
 really do without these perks and can live with the super columns
 limitations, then go on, use super columns. (And if you want the perks
 without the limitations, wait for Cassandra 1.2 and use CQL3 :D)


  ... and as I know, DynamicComposites is not recommended (and actually not
  supported by Pycassa).

 DynamicComposites don't do what you think they do. They do nothing
 more than regular composite as far as comparing them to SuperColumns
 is concerned, except giving you ways to shoot yourself in the foot.

 --
 Sylvain



Re: 1.1.5 Missing Insert! Strange Problem

2012-09-27 Thread Arya Goudarzi
I was restarting Cassandra nodes again today. 1 hour later my support team
let me know that a customer has reported some missing data. I suppose this
is the same issue. The application logs show that our client got success
from the Thrift log and proceeded with responding to the user and I could
grep the commit log for a missing record like I did before.

We have durable writes enabled. To me, it seams like when stuff are in
memtables and hasn't been flushed to disk, when I restart the node, the
commit log doesn't get replayed correctly.

Please advice.

On Thu, Sep 27, 2012 at 2:43 PM, Arya Goudarzi gouda...@gmail.com wrote:

 Thanks for your reply. I did grep on the commit logs for the offending key
 and grep showed Binary file matches. I am trying to use this tool to
 extract the commitlog and actually confirm if the mutation was a write:

 https://github.com/carloscm/cassandra-commitlog-extract.git


 On Thu, Sep 27, 2012 at 1:45 AM, Sylvain Lebresne sylv...@datastax.comwrote:

  I can verify the existence of the key that was inserted in Commitlogs
 of both replicas however it seams that this record was never inserted.

 Out of curiosity, how can you verify that?

 --
 Sylvain





Re: 1.1.5 Missing Insert! Strange Problem

2012-09-27 Thread Arya Goudarzi
rcoli helped me investigate this issue. The mystery was that the segment of
commit log was probably not fsynced to disk since the setting was set to
periodic with 10 second delay and CRC32 checksum validation failed skipping
the reply, so what happened in my scenario can be explained by this. I am
going to change our settings to batch mode.

Thank you rcoli for you help.

On Thu, Sep 27, 2012 at 2:49 PM, Arya Goudarzi gouda...@gmail.com wrote:

 I was restarting Cassandra nodes again today. 1 hour later my support team
 let me know that a customer has reported some missing data. I suppose this
 is the same issue. The application logs show that our client got success
 from the Thrift log and proceeded with responding to the user and I could
 grep the commit log for a missing record like I did before.

 We have durable writes enabled. To me, it seams like when stuff are in
 memtables and hasn't been flushed to disk, when I restart the node, the
 commit log doesn't get replayed correctly.

 Please advice.


 On Thu, Sep 27, 2012 at 2:43 PM, Arya Goudarzi gouda...@gmail.com wrote:

 Thanks for your reply. I did grep on the commit logs for the offending
 key and grep showed Binary file matches. I am trying to use this tool to
 extract the commitlog and actually confirm if the mutation was a write:

 https://github.com/carloscm/cassandra-commitlog-extract.git


 On Thu, Sep 27, 2012 at 1:45 AM, Sylvain Lebresne 
 sylv...@datastax.comwrote:

  I can verify the existence of the key that was inserted in Commitlogs
 of both replicas however it seams that this record was never inserted.

 Out of curiosity, how can you verify that?

 --
 Sylvain






Re: 1.1.5 Missing Insert! Strange Problem

2012-09-27 Thread Rob Coli
On Thu, Sep 27, 2012 at 3:25 PM, Arya Goudarzi gouda...@gmail.com wrote:
 rcoli helped me investigate this issue. The mystery was that the segment of
 commit log was probably not fsynced to disk since the setting was set to
 periodic with 10 second delay and CRC32 checksum validation failed skipping
 the reply, so what happened in my scenario can be explained by this. I am
 going to change our settings to batch mode.

To be clear, I conjectured that this behavior is the cause of the
issue. As there is no logging when Cassandra encounters a corrupt log
segment [1] during replay, I was unable to verify this conjecture.

Calling nodetool drain as part of a restart process should [2]
eliminate any chance of unsynced writes being lost, and is likely to
be more performant overall than changing to batch mode.

=Rob
[1] I plan to submit a patch for this..
[2] But doesn't necessarily in 1.0.x, CASSANDRA-4446 ...

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: downgrade from 1.1.4 to 1.0.X

2012-09-27 Thread Rob Coli
On Thu, Sep 27, 2012 at 2:46 AM, Віталій Тимчишин tiv...@gmail.com wrote:
 I suppose the way is to convert all SST to json, then install previous
 version, convert back and load

Only files flushed in the new version will need to be dumped/reloaded.

Files which have not been scrubbed/upgraded (ie, have the 1.0 -h[x]-
version) get renamed to different names in 1.1. You can revert all of
these files back to 1.0 as long as you change their names back to 1.0
style names, which is presumably what your snapshots contain...

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: 1000's of column families

2012-09-27 Thread Robin Verlangen
so if you add up all the applications
which would be huge and then all the tables which is large, it just keeps
growing.  It is a very nice concept(all data in one location), though we
will see how implementing it goes.

This shouldn't be a real problem for Cassandra. Just add more nodes and
ever node contains a smaller piece of the cake (~ring).

Best regards,

Robin Verlangen
*Software engineer*
*
*
W http://www.robinverlangen.nl
E ro...@us2.nl

http://goo.gl/Lt7BC

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.



2012/9/27 Hiller, Dean dean.hil...@nrel.gov

 Unfortunately, the security aspect is very strict.  Some make their data
 public but there are many projects where due to client contracts, they
 cannot make their data public within our company(ie. Other groups in our
 company are not allowed to see the data).

 Also, currently, we have researchers upload their own datasets as well.
 Ideally, they want to see this one noSQL store as the place where all data
 for the company livesŠALL of it so if you add up all the applications
 which would be huge and then all the tables which is large, it just keeps
 growing.  It is a very nice concept(all data in one location), though we
 will see how implementing it goes.

 How much overhead per column family in RAM?  So far we have around 4000
 Cfs with no issue that I see yet.

 Dean

 On 9/27/12 11:10 AM, Aaron Turner synfina...@gmail.com wrote:

 On Thu, Sep 27, 2012 at 3:11 PM, Hiller, Dean dean.hil...@nrel.gov
 wrote:
  We have 1000's of different building devices and we stream data from
 these devices.  The format and data from each one varies so one device
 has temperature at timeX with some other variables, another device has
 CO2 percentage and other variables.  Every device is unique and streams
 it's own data.  We dynamically discover devices and register them.
 Basically, one CF or table per thing really makes sense in this
 environment.  While we could try to find out which devices are
 similar, this would really be a pain and some devices add some new
 variable into the equation.  NOT only that but researchers can register
 new datasets and upload them as well and each dataset they have they do
 NOT want to share with other researches necessarily so we have security
 groups and each CF belongs to security groups.  We dynamically create
 CF's on the fly as people register new datasets.
 
  On top of that, when the data sets get too large, we probably want to
 partition a single CF into time partitions.  We could create one CF and
 put all the data and have a partition per device, but then a time
 partition will contain multiple devices of data meaning we need to
 shrink our time partition size where if we have CF per device, the time
 partition can be larger as it is only for that one device.
 
  THEN, on top of that, we have a meta CF for these devices so some
 people want to query for streams that match criteria AND which returns a
 CF name and they query that CF name so we almost need a query with
 variables like select cfName from Meta where x = y and then select *
 from cfName where x. Which we can do today.
 
 How strict are your security requirements?  If it wasn't for that,
 you'd be much better off storing data on a per-statistic basis then
 per-device.  Hell, you could store everything in a single CF by using
 a composite row key:
 
 devicename|stat type|instance
 
 But yeah, there isn't a hard limit for the number of CF's, but there
 is overhead associated with each one and so I wouldn't consider your
 design as scalable.  Generally speaking, hundreds are ok, but
 thousands is pushing it.
 
 
 
 --
 Aaron Turner
 http://synfin.net/ Twitter: @synfinatic
 http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix 
 Windows
 Those who would give up essential Liberty, to purchase a little temporary
 Safety, deserve neither Liberty nor Safety.
 -- Benjamin Franklin
 carpe diem quam minimum credula postero