fixed size collection possible?

2014-04-22 Thread Jimmy Lin
hi,
look at the collection type support in cql3,
e.g
http://www.datastax.com/documentation/cql/3.0/cql/cql_using/use_list_t.html

we can append or remove using + and - operator

UPDATE users
  SET top_places = top_places + [ 'mordor' ] WHERE user_id = 'frodo';

UPDATE users
  SET top_places = top_places - ['riddermark'] WHERE user_id = 'frodo';


is there a way to keep a fixed size of the list(collection) ?

I was thinking about using TTL to remove older data after certain time
but then the list will become too big if the ttl is too long, and if
ttl is too short I running the risk of having a empty list(if there is
no new activity).

Even if I don't use collection type and have my own table, I still ran
into the same issue.


Any recommendation to handle this type of situation?


thanks


Re: Deleting column names

2014-04-22 Thread Laing, Michael
Referring to the original post, I think the confusion is what is a row in
this context:

So as far as I understand, the s column is now the *row *key

...

Since I have multiple different p, o, c combinations per s, deleting the whole
 *row* identified by s is no option


The s column is in fact the *partition_key*, not the row key, which is the
composite of all 4 columns (the partiton_key plus the clustering columns).

Deleting the row, as Steven correctly showed, will not delete the
partition, but only the row - the tuple of the 4 columns.

Terminology has changed with cql and we all have to get used to it.

ml


On Mon, Apr 21, 2014 at 10:00 PM, Steven A Robenalt
srobe...@stanford.eduwrote:

 Is there a reason you can't use:

 DELETE FROM table_name WHERE s = ? AND p = ? AND o = ? AND c = ?;


 On Mon, Apr 21, 2014 at 6:51 PM, Eric Plowe eric.pl...@gmail.com wrote:

 Also I don't think you can null out columns that are part of the primary
 key after they've been set.


 On Monday, April 21, 2014, Andreas Wagner 
 andreas.josef.wag...@googlemail.com wrote:

 Hi cassandra users, hi Sebastian,

 I'd be interested in this ... is there any update/solution?

 Thanks so much ;)
 Andreas

 On 04/16/2014 11:43 AM, Sebastian Schmidt wrote:

 Hi,

 I'm using a Cassandra table to store some data. I created the table like
 this:
 CREATE TABLE IF NOT EXISTS table_name (s BLOB, p BLOB, o BLOB, c BLOB,
 PRIMARY KEY (s, p, o, c));

 I need the at least the p column to be sorted, so that I can use it in a
 WHERE clause. So as far as I understand, the s column is now the row
 key, and (p, o, c) is the column name.

 I tried to delete single entries with a prepared statement like this:
 DELETE p, o, c FROM table_name WHERE s = ? AND p = ? AND o = ? AND c =
 ?;

 That didn't work, because p is a primary key part. It failed during
 preparation.

 I also tried to use variables like this:
 DELETE ?, ?, ? FROM table_name WHERE s = ?;

 This also failed during preparation, because ? is an unknown identifier.


 Since I have multiple different p, o, c combinations per s, deleting the
 whole row identified by s is no option. So how can I delete a s, p, o, c
 tuple, without deleting other s, p, o, c tuples with the same s? I know
 that this worked with Thrift/Hector before.

 Regards,
 Sebastian





 --
 Steve Robenalt
 Software Architect
 HighWire | Stanford University
 425 Broadway St, Redwood City, CA 94063

 srobe...@stanford.edu
 http://highwire.stanford.edu








Does NetworkTopologyStrategy in Cassandra 2.0 work?

2014-04-22 Thread horschi
Hi,

is it possible that NetworkTopologyStrategy does not work with Cassandra
2.0 any more?

I just updated my Dev Cluster to 2.0.7 and got UnavailableExceptions for
CQLThrift queries on my already existing column families, even though all
(two) nodes were up. Changing to SimpleStrategy fixed the issue.

Also I cannot switch switch back to NetworkTopologyStrategy:

[default@unknown] update keyspace MYKS with placement_strategy =
'NetworkTopologyStrategy';
Error constructing replication strategy class

[default@unknown] update keyspace MYKS with placement_strategy =
'org.apache.cassandra.locator.NetworkTopologyStrategy';
Error constructing replication strategy class


This does not seem to be something I encountered with 1.2 before. Can
anyone tell me which one is broken here, Cassandra or myself? :-)

cheers,
Christian


Re: Does NetworkTopologyStrategy in Cassandra 2.0 work?

2014-04-22 Thread horschi
Ok, it seems 2.0 now is simply stricter about datacenter names. I simply
had to change the datacenter name to match the name in nodetool ring:

update keyspace MYKS with placement_strategy = 'NetworkTopologyStrategy'
and strategy_options = {datacenter1 : 2};

So the schema was wrong, but 1.2 did not care about it.

cheers,
Christian


On Tue, Apr 22, 2014 at 1:51 PM, horschi hors...@gmail.com wrote:

 Hi,

 is it possible that NetworkTopologyStrategy does not work with Cassandra
 2.0 any more?

 I just updated my Dev Cluster to 2.0.7 and got UnavailableExceptions for
 CQLThrift queries on my already existing column families, even though all
 (two) nodes were up. Changing to SimpleStrategy fixed the issue.

 Also I cannot switch switch back to NetworkTopologyStrategy:

 [default@unknown] update keyspace MYKS with placement_strategy =
 'NetworkTopologyStrategy';
 Error constructing replication strategy class

 [default@unknown] update keyspace MYKS with placement_strategy =
 'org.apache.cassandra.locator.NetworkTopologyStrategy';
 Error constructing replication strategy class


 This does not seem to be something I encountered with 1.2 before. Can
 anyone tell me which one is broken here, Cassandra or myself? :-)

 cheers,
 Christian



BulkOutputFormat and CQL3

2014-04-22 Thread James Campbell
Hi Cassandra Users-

I have a Hadoop job that uses the pattern in Cassandra 2.0.6's 
hadoop_cql3_word_count example to load data from HDFS into Cassandra.  Having 
read about BulkOutputFormat as a way to potentially significantly increase the 
write throughput from Hadoop to Cassandra, I am considering testing against 
that pattern (http://www.datastax.com/dev/blog/improved-hadoop-output, 
http://shareitexploreit.blogspot.com/2012/03/bulkloadto-cassandra-with-hadoop.html
 ).

Is it possible/supported/recommended to use the BulkOutputFormat to load data 
from Hadoop to a CQL3 table in Cassandra?

I see several examples of building composite keys using Hector (e.g. 
http://www.datastax.com/dev/blog/introduction-to-composite-columns-part-1, 
http://brianoneill.blogspot.com/2012/09/composite-keys-connecting-dots-between.html
 ), but with the changes to support CQL3 having left a lot of different 
documentation out there for different versions, it's not clear to me what the 
proper way to build the requisite ByteBuffer, ListMutation pairs that the 
ColumnFamilyOutputFormat (and so BulkOutputFormat) needs.

James






Re: Deleting column names

2014-04-22 Thread Sebastian Schmidt
From my understanding, this would delete all entries with the given s.
Meaning, if I have inserted (sa, p1, o1, c1) and (sa, p2, o2, c2),
executing this:

DELETE FROM table_name WHERE s = sa AND p = p1 AND o = o1 AND c = c1

would delete sa, p1, o1, c1, p2, o2, c2. Is this correct? Or does the above
statement only delete p1, o1, c1?


2014-04-22 4:00 GMT+02:00 Steven A Robenalt srobe...@stanford.edu:

 Is there a reason you can't use:

 DELETE FROM table_name WHERE s = ? AND p = ? AND o = ? AND c = ?;


 On Mon, Apr 21, 2014 at 6:51 PM, Eric Plowe eric.pl...@gmail.com wrote:

 Also I don't think you can null out columns that are part of the primary
 key after they've been set.


 On Monday, April 21, 2014, Andreas Wagner 
 andreas.josef.wag...@googlemail.com wrote:

 Hi cassandra users, hi Sebastian,

 I'd be interested in this ... is there any update/solution?

 Thanks so much ;)
 Andreas

 On 04/16/2014 11:43 AM, Sebastian Schmidt wrote:

 Hi,

 I'm using a Cassandra table to store some data. I created the table like
 this:
 CREATE TABLE IF NOT EXISTS table_name (s BLOB, p BLOB, o BLOB, c BLOB,
 PRIMARY KEY (s, p, o, c));

 I need the at least the p column to be sorted, so that I can use it in a
 WHERE clause. So as far as I understand, the s column is now the row
 key, and (p, o, c) is the column name.

 I tried to delete single entries with a prepared statement like this:
 DELETE p, o, c FROM table_name WHERE s = ? AND p = ? AND o = ? AND c =
 ?;

 That didn't work, because p is a primary key part. It failed during
 preparation.

 I also tried to use variables like this:
 DELETE ?, ?, ? FROM table_name WHERE s = ?;

 This also failed during preparation, because ? is an unknown identifier.


 Since I have multiple different p, o, c combinations per s, deleting the
 whole row identified by s is no option. So how can I delete a s, p, o, c
 tuple, without deleting other s, p, o, c tuples with the same s? I know
 that this worked with Thrift/Hector before.

 Regards,
 Sebastian





 --
 Steve Robenalt
 Software Architect
 HighWire | Stanford University
 425 Broadway St, Redwood City, CA 94063

 srobe...@stanford.edu
 http://highwire.stanford.edu








Re: Deleting column names

2014-04-22 Thread Laing, Michael
Your understanding is incorrect - the easiest way to see that is to try it.


On Tue, Apr 22, 2014 at 12:00 PM, Sebastian Schmidt isib...@gmail.comwrote:

 From my understanding, this would delete all entries with the given s.
 Meaning, if I have inserted (sa, p1, o1, c1) and (sa, p2, o2, c2),
 executing this:

 DELETE FROM table_name WHERE s = sa AND p = p1 AND o = o1 AND c = c1

 would delete sa, p1, o1, c1, p2, o2, c2. Is this correct? Or does the
 above statement only delete p1, o1, c1?


 2014-04-22 4:00 GMT+02:00 Steven A Robenalt srobe...@stanford.edu:

 Is there a reason you can't use:

 DELETE FROM table_name WHERE s = ? AND p = ? AND o = ? AND c = ?;


 On Mon, Apr 21, 2014 at 6:51 PM, Eric Plowe eric.pl...@gmail.com wrote:

 Also I don't think you can null out columns that are part of the primary
 key after they've been set.


 On Monday, April 21, 2014, Andreas Wagner 
 andreas.josef.wag...@googlemail.com wrote:

 Hi cassandra users, hi Sebastian,

 I'd be interested in this ... is there any update/solution?

 Thanks so much ;)
 Andreas

 On 04/16/2014 11:43 AM, Sebastian Schmidt wrote:

 Hi,

 I'm using a Cassandra table to store some data. I created the table
 like
 this:
 CREATE TABLE IF NOT EXISTS table_name (s BLOB, p BLOB, o BLOB, c BLOB,
 PRIMARY KEY (s, p, o, c));

 I need the at least the p column to be sorted, so that I can use it in
 a
 WHERE clause. So as far as I understand, the s column is now the row
 key, and (p, o, c) is the column name.

 I tried to delete single entries with a prepared statement like this:
 DELETE p, o, c FROM table_name WHERE s = ? AND p = ? AND o = ? AND c =
 ?;

 That didn't work, because p is a primary key part. It failed during
 preparation.

 I also tried to use variables like this:
 DELETE ?, ?, ? FROM table_name WHERE s = ?;

 This also failed during preparation, because ? is an unknown
 identifier.


 Since I have multiple different p, o, c combinations per s, deleting
 the
 whole row identified by s is no option. So how can I delete a s, p, o,
 c
 tuple, without deleting other s, p, o, c tuples with the same s? I know
 that this worked with Thrift/Hector before.

 Regards,
 Sebastian





 --
 Steve Robenalt
 Software Architect
  HighWire | Stanford University
 425 Broadway St, Redwood City, CA 94063

 srobe...@stanford.edu
 http://highwire.stanford.edu









Re: fixed size collection possible?

2014-04-22 Thread Chris Lohfink
It isn’t natively supported but theres some things you can do if need it.

A lot depends on how frequently this list is getting updated. For heavier 
workloads I would recommend using a custom CF for this instead of collections.  
If extreme inserts you would want to add additional partitioning to it as well. 
 As mentioned below Id recommend having a cleanup MR job to periodically clean 
it up if the cost of TTLs possibly leading to 0 entries is too expensive.  
Putting it in its own CF helps in that it removes the elements of the list from 
polluting your users partition.  If there gets to be a lot of 
tombstones/inserts this could make reading the user bad (it would look like 
queue which has horrible performance) so it will at least section off that 
badness from the regular user lookups.

CREATE TABLE user_top_places (
  user_id varchar,
  created timeuuid,
  place varchar,
  PRIMARY KEY (user_id, created))
  WITH CLUSTERING ORDER BY (created DESC);

then to add a new one to the front of the “list”

 INSERT INTO user_top_places (user_id, created, place) VALUES ('frodo', now(), 
'mordor’);

and you can see the last 10 entries

SELECT * FROM user_top_places WHERE user_id = 'frodo' LIMIT 10;

This will give you the last 10 entries (allows duplicates though).  Older 
records will still be around though and disk space could eventually become a 
problem for you.  If it becomes bad I would recommend using a periodic job like 
hadoop to remove excess columns (solely to save disk space).  Although if can 
afford the disk it would give better performance if just let it grow to a point 
(providing rows don’t get too large, i.e. 64mb).  If this isn’t very high in 
writes there might be some more clever things you can do...

If not having duplicates is more important then you can set “place” as your 
column name:

CREATE TABLE user_top_places (user_id varchar, place varchar, created 
timestamp, PRIMARY KEY (user_id, place));
INSERT INTO user_top_places (user_id, place, created) VALUES ('frodo', 
'mordor', dateof(now()));

but the results won’t be in order of latest inserted so might have to do some 
client side filtering to show the latest only using the created field.

---
Chris Lohfink

On Apr 22, 2014, at 1:51 AM, Jimmy Lin y2klyf+w...@gmail.com wrote:

 hi,
 look at the collection type support in cql3,
 e.g
 http://www.datastax.com/documentation/cql/3.0/cql/cql_using/use_list_t.html
 
 we can append or remove using + and - operator
 UPDATE users
   SET top_places = top_places + [ 'mordor' ] WHERE user_id = 'frodo';
 UPDATE users
   SET top_places = top_places - ['riddermark'] WHERE user_id = 'frodo';
 
 is there a way to keep a fixed size of the list(collection) ?
 I was thinking about using TTL to remove older data after certain time but 
 then the list will become too big if the ttl is too long, and if ttl is too 
 short I running the risk of having a empty list(if there is no new activity).
 
 Even if I don't use collection type and have my own table, I still ran into 
 the same issue.
 
 Any recommendation to handle this type of situation?
 
 thanks
 



Re: Doubt

2014-04-22 Thread Chris Lohfink
Generally Ive seen it recommended to do a composite CF since it gives you more 
flexibility and its easier to debug.  You can get some performance improvements 
by storing a serialized blob (a lot of data you can represent much smaller this 
way by factor of 10 or more if clever) to represent your entity but the 
complexity is rarely worth it.  It is likely a premature optimization but I 
have seen cases its shown a good improvement.

either case, the data will ultimately be read sequentially from disk per 
sstable (normal bottleneck) so the only benefit you gain is 
- potentially disk space (if serialization is efficient) and network bandwidth
- Cassandra won’t have to deserialize as many columns, but I’m fairly certain 
this is utterly irrelevant
- if stored in a mechanism that you can deserialize efficiently (like 
protobufs) it can make a big difference on your app side

keep in mind if serializing data though you will have to always maintain code 
that will be able to read old versions, it can become very complex and lead to 
weird bugs.

---
Chris Lohfink

On Apr 21, 2014, at 3:53 AM, Jagan Ranganathan ja...@zohocorp.com wrote:

 Dear All,
 
 We have a requirement to store 'N' columns of an entity in a CF. Mostly this 
 is write once and read many times. What is the best way to store the data?
 Composite CF
 Simple CF with value as protobuf extracted data
 Both provides extendable columns which is a requirement for our usage. 
 
 But I want to know which one is efficient, assuming there is bound to be say 
 5% of updates?
 
 Regards,
 Jagan



which replica has your data?

2014-04-22 Thread Han,Meng

Hi all,

I have a data item whose row key is 
7573657238353137303937323637363334393636363230
and I have a five node Cassandra cluster with replication factor set to 
3. Each replica's token is listed below


TOK: 0
TOK: 34028236692093846346337460743176821145
TOK: 68056473384187692692674921486353642291
TOK: 68056473384187692692674921486353642291
TOK: 102084710076281539039012382229530463436
TOK: 136112946768375385385349842972707284582

All the five nodes are on the same rack and I am using SimpleSnitch. 
Could someone tell me how can I know which replica has/stores that 
particular row above?


Is there any client side command that can query which replica has a 
certain row?


Thank loads!


Cheers,
Meng


Re: which replica has your data?

2014-04-22 Thread Russell Bradberry
nodetool getendpoints keyspace cf key



On April 22, 2014 at 4:52:08 PM, Han,Meng (meng...@ufl.edu) wrote:

Hi all,  

I have a data item whose row key is  
7573657238353137303937323637363334393636363230  
and I have a five node Cassandra cluster with replication factor set to  
3. Each replica's token is listed below  

TOK: 0  
TOK: 34028236692093846346337460743176821145  
TOK: 68056473384187692692674921486353642291  
TOK: 68056473384187692692674921486353642291  
TOK: 102084710076281539039012382229530463436  
TOK: 136112946768375385385349842972707284582  

All the five nodes are on the same rack and I am using SimpleSnitch.  
Could someone tell me how can I know which replica has/stores that  
particular row above?  

Is there any client side command that can query which replica has a  
certain row?  

Thank loads!  


Cheers,  
Meng  


Re: which replica has your data?

2014-04-22 Thread Robert Coli
On Tue, Apr 22, 2014 at 1:55 PM, Russell Bradberry rbradbe...@gmail.comwrote:

 nodetool getendpoints keyspace cf key


That will tell OP what nodes *should* have the row... to answer which of
those replicas *actually have* the row, call the JMX
method getSSTablesForKey, on each node returned by getendpoints. If there
is at least one SSTable listed, the node *has* the row.

The code doesn't seem to differentiate between a tombstone or other masked
value, FWIW, so your client might not see a row that getSSTablesForKey says
is in the files.

=Rob