fixed size collection possible?
hi, look at the collection type support in cql3, e.g http://www.datastax.com/documentation/cql/3.0/cql/cql_using/use_list_t.html we can append or remove using + and - operator UPDATE users SET top_places = top_places + [ 'mordor' ] WHERE user_id = 'frodo'; UPDATE users SET top_places = top_places - ['riddermark'] WHERE user_id = 'frodo'; is there a way to keep a fixed size of the list(collection) ? I was thinking about using TTL to remove older data after certain time but then the list will become too big if the ttl is too long, and if ttl is too short I running the risk of having a empty list(if there is no new activity). Even if I don't use collection type and have my own table, I still ran into the same issue. Any recommendation to handle this type of situation? thanks
Re: Deleting column names
Referring to the original post, I think the confusion is what is a row in this context: So as far as I understand, the s column is now the *row *key ... Since I have multiple different p, o, c combinations per s, deleting the whole *row* identified by s is no option The s column is in fact the *partition_key*, not the row key, which is the composite of all 4 columns (the partiton_key plus the clustering columns). Deleting the row, as Steven correctly showed, will not delete the partition, but only the row - the tuple of the 4 columns. Terminology has changed with cql and we all have to get used to it. ml On Mon, Apr 21, 2014 at 10:00 PM, Steven A Robenalt srobe...@stanford.eduwrote: Is there a reason you can't use: DELETE FROM table_name WHERE s = ? AND p = ? AND o = ? AND c = ?; On Mon, Apr 21, 2014 at 6:51 PM, Eric Plowe eric.pl...@gmail.com wrote: Also I don't think you can null out columns that are part of the primary key after they've been set. On Monday, April 21, 2014, Andreas Wagner andreas.josef.wag...@googlemail.com wrote: Hi cassandra users, hi Sebastian, I'd be interested in this ... is there any update/solution? Thanks so much ;) Andreas On 04/16/2014 11:43 AM, Sebastian Schmidt wrote: Hi, I'm using a Cassandra table to store some data. I created the table like this: CREATE TABLE IF NOT EXISTS table_name (s BLOB, p BLOB, o BLOB, c BLOB, PRIMARY KEY (s, p, o, c)); I need the at least the p column to be sorted, so that I can use it in a WHERE clause. So as far as I understand, the s column is now the row key, and (p, o, c) is the column name. I tried to delete single entries with a prepared statement like this: DELETE p, o, c FROM table_name WHERE s = ? AND p = ? AND o = ? AND c = ?; That didn't work, because p is a primary key part. It failed during preparation. I also tried to use variables like this: DELETE ?, ?, ? FROM table_name WHERE s = ?; This also failed during preparation, because ? is an unknown identifier. Since I have multiple different p, o, c combinations per s, deleting the whole row identified by s is no option. So how can I delete a s, p, o, c tuple, without deleting other s, p, o, c tuples with the same s? I know that this worked with Thrift/Hector before. Regards, Sebastian -- Steve Robenalt Software Architect HighWire | Stanford University 425 Broadway St, Redwood City, CA 94063 srobe...@stanford.edu http://highwire.stanford.edu
Does NetworkTopologyStrategy in Cassandra 2.0 work?
Hi, is it possible that NetworkTopologyStrategy does not work with Cassandra 2.0 any more? I just updated my Dev Cluster to 2.0.7 and got UnavailableExceptions for CQLThrift queries on my already existing column families, even though all (two) nodes were up. Changing to SimpleStrategy fixed the issue. Also I cannot switch switch back to NetworkTopologyStrategy: [default@unknown] update keyspace MYKS with placement_strategy = 'NetworkTopologyStrategy'; Error constructing replication strategy class [default@unknown] update keyspace MYKS with placement_strategy = 'org.apache.cassandra.locator.NetworkTopologyStrategy'; Error constructing replication strategy class This does not seem to be something I encountered with 1.2 before. Can anyone tell me which one is broken here, Cassandra or myself? :-) cheers, Christian
Re: Does NetworkTopologyStrategy in Cassandra 2.0 work?
Ok, it seems 2.0 now is simply stricter about datacenter names. I simply had to change the datacenter name to match the name in nodetool ring: update keyspace MYKS with placement_strategy = 'NetworkTopologyStrategy' and strategy_options = {datacenter1 : 2}; So the schema was wrong, but 1.2 did not care about it. cheers, Christian On Tue, Apr 22, 2014 at 1:51 PM, horschi hors...@gmail.com wrote: Hi, is it possible that NetworkTopologyStrategy does not work with Cassandra 2.0 any more? I just updated my Dev Cluster to 2.0.7 and got UnavailableExceptions for CQLThrift queries on my already existing column families, even though all (two) nodes were up. Changing to SimpleStrategy fixed the issue. Also I cannot switch switch back to NetworkTopologyStrategy: [default@unknown] update keyspace MYKS with placement_strategy = 'NetworkTopologyStrategy'; Error constructing replication strategy class [default@unknown] update keyspace MYKS with placement_strategy = 'org.apache.cassandra.locator.NetworkTopologyStrategy'; Error constructing replication strategy class This does not seem to be something I encountered with 1.2 before. Can anyone tell me which one is broken here, Cassandra or myself? :-) cheers, Christian
BulkOutputFormat and CQL3
Hi Cassandra Users- I have a Hadoop job that uses the pattern in Cassandra 2.0.6's hadoop_cql3_word_count example to load data from HDFS into Cassandra. Having read about BulkOutputFormat as a way to potentially significantly increase the write throughput from Hadoop to Cassandra, I am considering testing against that pattern (http://www.datastax.com/dev/blog/improved-hadoop-output, http://shareitexploreit.blogspot.com/2012/03/bulkloadto-cassandra-with-hadoop.html ). Is it possible/supported/recommended to use the BulkOutputFormat to load data from Hadoop to a CQL3 table in Cassandra? I see several examples of building composite keys using Hector (e.g. http://www.datastax.com/dev/blog/introduction-to-composite-columns-part-1, http://brianoneill.blogspot.com/2012/09/composite-keys-connecting-dots-between.html ), but with the changes to support CQL3 having left a lot of different documentation out there for different versions, it's not clear to me what the proper way to build the requisite ByteBuffer, ListMutation pairs that the ColumnFamilyOutputFormat (and so BulkOutputFormat) needs. James
Re: Deleting column names
From my understanding, this would delete all entries with the given s. Meaning, if I have inserted (sa, p1, o1, c1) and (sa, p2, o2, c2), executing this: DELETE FROM table_name WHERE s = sa AND p = p1 AND o = o1 AND c = c1 would delete sa, p1, o1, c1, p2, o2, c2. Is this correct? Or does the above statement only delete p1, o1, c1? 2014-04-22 4:00 GMT+02:00 Steven A Robenalt srobe...@stanford.edu: Is there a reason you can't use: DELETE FROM table_name WHERE s = ? AND p = ? AND o = ? AND c = ?; On Mon, Apr 21, 2014 at 6:51 PM, Eric Plowe eric.pl...@gmail.com wrote: Also I don't think you can null out columns that are part of the primary key after they've been set. On Monday, April 21, 2014, Andreas Wagner andreas.josef.wag...@googlemail.com wrote: Hi cassandra users, hi Sebastian, I'd be interested in this ... is there any update/solution? Thanks so much ;) Andreas On 04/16/2014 11:43 AM, Sebastian Schmidt wrote: Hi, I'm using a Cassandra table to store some data. I created the table like this: CREATE TABLE IF NOT EXISTS table_name (s BLOB, p BLOB, o BLOB, c BLOB, PRIMARY KEY (s, p, o, c)); I need the at least the p column to be sorted, so that I can use it in a WHERE clause. So as far as I understand, the s column is now the row key, and (p, o, c) is the column name. I tried to delete single entries with a prepared statement like this: DELETE p, o, c FROM table_name WHERE s = ? AND p = ? AND o = ? AND c = ?; That didn't work, because p is a primary key part. It failed during preparation. I also tried to use variables like this: DELETE ?, ?, ? FROM table_name WHERE s = ?; This also failed during preparation, because ? is an unknown identifier. Since I have multiple different p, o, c combinations per s, deleting the whole row identified by s is no option. So how can I delete a s, p, o, c tuple, without deleting other s, p, o, c tuples with the same s? I know that this worked with Thrift/Hector before. Regards, Sebastian -- Steve Robenalt Software Architect HighWire | Stanford University 425 Broadway St, Redwood City, CA 94063 srobe...@stanford.edu http://highwire.stanford.edu
Re: Deleting column names
Your understanding is incorrect - the easiest way to see that is to try it. On Tue, Apr 22, 2014 at 12:00 PM, Sebastian Schmidt isib...@gmail.comwrote: From my understanding, this would delete all entries with the given s. Meaning, if I have inserted (sa, p1, o1, c1) and (sa, p2, o2, c2), executing this: DELETE FROM table_name WHERE s = sa AND p = p1 AND o = o1 AND c = c1 would delete sa, p1, o1, c1, p2, o2, c2. Is this correct? Or does the above statement only delete p1, o1, c1? 2014-04-22 4:00 GMT+02:00 Steven A Robenalt srobe...@stanford.edu: Is there a reason you can't use: DELETE FROM table_name WHERE s = ? AND p = ? AND o = ? AND c = ?; On Mon, Apr 21, 2014 at 6:51 PM, Eric Plowe eric.pl...@gmail.com wrote: Also I don't think you can null out columns that are part of the primary key after they've been set. On Monday, April 21, 2014, Andreas Wagner andreas.josef.wag...@googlemail.com wrote: Hi cassandra users, hi Sebastian, I'd be interested in this ... is there any update/solution? Thanks so much ;) Andreas On 04/16/2014 11:43 AM, Sebastian Schmidt wrote: Hi, I'm using a Cassandra table to store some data. I created the table like this: CREATE TABLE IF NOT EXISTS table_name (s BLOB, p BLOB, o BLOB, c BLOB, PRIMARY KEY (s, p, o, c)); I need the at least the p column to be sorted, so that I can use it in a WHERE clause. So as far as I understand, the s column is now the row key, and (p, o, c) is the column name. I tried to delete single entries with a prepared statement like this: DELETE p, o, c FROM table_name WHERE s = ? AND p = ? AND o = ? AND c = ?; That didn't work, because p is a primary key part. It failed during preparation. I also tried to use variables like this: DELETE ?, ?, ? FROM table_name WHERE s = ?; This also failed during preparation, because ? is an unknown identifier. Since I have multiple different p, o, c combinations per s, deleting the whole row identified by s is no option. So how can I delete a s, p, o, c tuple, without deleting other s, p, o, c tuples with the same s? I know that this worked with Thrift/Hector before. Regards, Sebastian -- Steve Robenalt Software Architect HighWire | Stanford University 425 Broadway St, Redwood City, CA 94063 srobe...@stanford.edu http://highwire.stanford.edu
Re: fixed size collection possible?
It isn’t natively supported but theres some things you can do if need it. A lot depends on how frequently this list is getting updated. For heavier workloads I would recommend using a custom CF for this instead of collections. If extreme inserts you would want to add additional partitioning to it as well. As mentioned below Id recommend having a cleanup MR job to periodically clean it up if the cost of TTLs possibly leading to 0 entries is too expensive. Putting it in its own CF helps in that it removes the elements of the list from polluting your users partition. If there gets to be a lot of tombstones/inserts this could make reading the user bad (it would look like queue which has horrible performance) so it will at least section off that badness from the regular user lookups. CREATE TABLE user_top_places ( user_id varchar, created timeuuid, place varchar, PRIMARY KEY (user_id, created)) WITH CLUSTERING ORDER BY (created DESC); then to add a new one to the front of the “list” INSERT INTO user_top_places (user_id, created, place) VALUES ('frodo', now(), 'mordor’); and you can see the last 10 entries SELECT * FROM user_top_places WHERE user_id = 'frodo' LIMIT 10; This will give you the last 10 entries (allows duplicates though). Older records will still be around though and disk space could eventually become a problem for you. If it becomes bad I would recommend using a periodic job like hadoop to remove excess columns (solely to save disk space). Although if can afford the disk it would give better performance if just let it grow to a point (providing rows don’t get too large, i.e. 64mb). If this isn’t very high in writes there might be some more clever things you can do... If not having duplicates is more important then you can set “place” as your column name: CREATE TABLE user_top_places (user_id varchar, place varchar, created timestamp, PRIMARY KEY (user_id, place)); INSERT INTO user_top_places (user_id, place, created) VALUES ('frodo', 'mordor', dateof(now())); but the results won’t be in order of latest inserted so might have to do some client side filtering to show the latest only using the created field. --- Chris Lohfink On Apr 22, 2014, at 1:51 AM, Jimmy Lin y2klyf+w...@gmail.com wrote: hi, look at the collection type support in cql3, e.g http://www.datastax.com/documentation/cql/3.0/cql/cql_using/use_list_t.html we can append or remove using + and - operator UPDATE users SET top_places = top_places + [ 'mordor' ] WHERE user_id = 'frodo'; UPDATE users SET top_places = top_places - ['riddermark'] WHERE user_id = 'frodo'; is there a way to keep a fixed size of the list(collection) ? I was thinking about using TTL to remove older data after certain time but then the list will become too big if the ttl is too long, and if ttl is too short I running the risk of having a empty list(if there is no new activity). Even if I don't use collection type and have my own table, I still ran into the same issue. Any recommendation to handle this type of situation? thanks
Re: Doubt
Generally Ive seen it recommended to do a composite CF since it gives you more flexibility and its easier to debug. You can get some performance improvements by storing a serialized blob (a lot of data you can represent much smaller this way by factor of 10 or more if clever) to represent your entity but the complexity is rarely worth it. It is likely a premature optimization but I have seen cases its shown a good improvement. either case, the data will ultimately be read sequentially from disk per sstable (normal bottleneck) so the only benefit you gain is - potentially disk space (if serialization is efficient) and network bandwidth - Cassandra won’t have to deserialize as many columns, but I’m fairly certain this is utterly irrelevant - if stored in a mechanism that you can deserialize efficiently (like protobufs) it can make a big difference on your app side keep in mind if serializing data though you will have to always maintain code that will be able to read old versions, it can become very complex and lead to weird bugs. --- Chris Lohfink On Apr 21, 2014, at 3:53 AM, Jagan Ranganathan ja...@zohocorp.com wrote: Dear All, We have a requirement to store 'N' columns of an entity in a CF. Mostly this is write once and read many times. What is the best way to store the data? Composite CF Simple CF with value as protobuf extracted data Both provides extendable columns which is a requirement for our usage. But I want to know which one is efficient, assuming there is bound to be say 5% of updates? Regards, Jagan
which replica has your data?
Hi all, I have a data item whose row key is 7573657238353137303937323637363334393636363230 and I have a five node Cassandra cluster with replication factor set to 3. Each replica's token is listed below TOK: 0 TOK: 34028236692093846346337460743176821145 TOK: 68056473384187692692674921486353642291 TOK: 68056473384187692692674921486353642291 TOK: 102084710076281539039012382229530463436 TOK: 136112946768375385385349842972707284582 All the five nodes are on the same rack and I am using SimpleSnitch. Could someone tell me how can I know which replica has/stores that particular row above? Is there any client side command that can query which replica has a certain row? Thank loads! Cheers, Meng
Re: which replica has your data?
nodetool getendpoints keyspace cf key On April 22, 2014 at 4:52:08 PM, Han,Meng (meng...@ufl.edu) wrote: Hi all, I have a data item whose row key is 7573657238353137303937323637363334393636363230 and I have a five node Cassandra cluster with replication factor set to 3. Each replica's token is listed below TOK: 0 TOK: 34028236692093846346337460743176821145 TOK: 68056473384187692692674921486353642291 TOK: 68056473384187692692674921486353642291 TOK: 102084710076281539039012382229530463436 TOK: 136112946768375385385349842972707284582 All the five nodes are on the same rack and I am using SimpleSnitch. Could someone tell me how can I know which replica has/stores that particular row above? Is there any client side command that can query which replica has a certain row? Thank loads! Cheers, Meng
Re: which replica has your data?
On Tue, Apr 22, 2014 at 1:55 PM, Russell Bradberry rbradbe...@gmail.comwrote: nodetool getendpoints keyspace cf key That will tell OP what nodes *should* have the row... to answer which of those replicas *actually have* the row, call the JMX method getSSTablesForKey, on each node returned by getendpoints. If there is at least one SSTable listed, the node *has* the row. The code doesn't seem to differentiate between a tombstone or other masked value, FWIW, so your client might not see a row that getSSTablesForKey says is in the files. =Rob