Gradle script to execute cql3 scripts
I have a requirement to execute CQL3 scripts through Gradle, do we have any cassandra plugin for Gradle to do the same or is there any other way I can execute CQL3 scripts during the build itself. Please suggest. Dawood
read ?
Hi all, Quick question I currently am looking at a 4 node cluster and I have currently stopped all writing to Cassandra, with the reads continuing. I'm trying to understand the utilization of memory within the JVM. nodetool info on each of the nodes shows them all growing in footprint, 2 of the three at a greater rate. On the restart of Cassandra each were at about 100MB, after 2 days, each of the following are at: Heap Memory (MB) : 798.41 / 3052.00 Heap Memory (MB) : 370.44 / 3052.00 Heap Memory (MB) : 549.73 / 3052.00 Heap Memory (MB) : 481.89 / 3052.00 Ring configuration: Address RackStatus State LoadOwns Token 127605887595351923798765477786913079296 x 1d Up Normal 4.38 GB 25.00% 0 x 1d Up Normal 4.17 GB 25.00% 42535295865117307932921825928971026432 x 1d Up Normal 4.19 GB 25.00% 85070591730234615865843651857942052864 x 1d Up Normal 4.14 GB 25.00% 127605887595351923798765477786913079296 What I'm not sure of is what the growth is different between each ? and why that growth is being created by activity that is read only. Is Cassandra caching and holding the read data ? I currently have caching turned off for the key/row. Also as part of the info command Key Cache: size 0 (bytes), capacity 0 (bytes), 0 hits, 0 requests, NaN recent hit rate, 14400 save period in seconds Row Cache: size 0 (bytes), capacity 0 (bytes), 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds Thanks, Jim
[RELEASE] Apache Cassandra 2.0 released
The Cassandra team is very pleased to announce the release of Apache Cassandra version 2.0.0. Cassandra 2.0.0 is a new major release that adds numerous improvements[1,2], including: - Lightweight transactions[4] that offers linearizable consistency. - Experimental Triggers Support[5]. - Numerous enhancements to CQL as well as a new and better version of the native protocol[6]. - Compaction improvements[7] (including a hybrid strategy that combines leveled and size-tiered compaction). - A new faster Thrift Server implementation based on LMAX Disruptor[8]. - Eager retries: avoids query timeout by sending data requests to other replicas if too much time passes on the original request. See the full changelog[1] for more and please make sure to check the release notes[2] for upgrading details. Both source and binary distributions of Cassandra 2.0.0 can be downloaded at: http://cassandra.apache.org/download/ As usual, a debian package is available from the project APT repository[3] (you will need to use the 20x series). The Cassandra team [1]: http://goo.gl/zU4sWv (CHANGES.txt) [2]: http://goo.gl/MrR6Qn (NEWS.txt) [3]: http://wiki.apache.org/cassandra/DebianPackaging [4]: http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0 [5]: http://www.datastax.com/dev/blog/whats-new-in-cassandra-2-0-prototype-triggers-support [6]: http://www.datastax.com/dev/blog/cql-in-cassandra-2-0 [7]: https://issues.apache.org/jira/browse/CASSANDRA-5371 [8]: https://issues.apache.org/jira/browse/CASSANDRA-5582
Re: CqlStorage creates wrong schema for Pig
You're trying to use FromCqlColumn on a tuple that has been flattened. The schema still thinks it's {title: chararray}, but the flattened tuple is now two values. I don't know how to retrieve the data values in this case. Your code will work correctly if you do this: *values3 = FOREACH rows GENERATE FromCqlColumn(title) AS title;* *dump values3;* *describe values3;* (Use FromCqlColumn on the original data, not the flattened data.) Chad On Mon, Sep 2, 2013 at 8:45 AM, Miguel Angel Martin junquera mianmarjun.mailingl...@gmail.com wrote: Hi 1.- May be? -- Register the UDF REGISTER /path/to/cqlstorageudf-1.0-SNAPSHOT -- FromCqlColumn will convert chararray, int, long, float, double DEFINE FromCqlColumn com.megatome.pig.piggybank.tuple.FromCqlColumn(); -- Load data as normal data_raw = LOAD 'cql://bookcrossing/books' USING CqlStorage(); -- Use the UDF data = FOREACH data_raw GENERATE *FromCqlColumn*(isbn) AS ISBN, *FromCqlColumn*(bookauthor) AS BookAuthor, *FromCqlColumn*(booktitle) AS BookTitle, *FromCqlColumn*(publisher) AS Publisher, *FromCqlColumn*(yearofpublication) AS YearOfPublication; and 2.: with the data in cql cassandra 1.2.8, pig 0.11.11 and cql3: *CREATE KEYSPACE keyspace1* * WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }* * AND durable_writes = true;* * * *use keyspace2;* * * * CREATE TABLE test (* *id text PRIMARY KEY,* *title text,* *age int* * ) WITH COMPACT STORAGE;* * * * * * insert into test (id, title, age) values('1', 'child', 21);* * insert into test (id, title, age) values('2', 'support', 21);* * insert into test (id, title, age) values('3', 'manager', 31);* * insert into test (id, title, age) values('4', 'QA', 41);* * insert into test (id, title, age) values('5', 'QA', 30);* * insert into test (id, title, age) values('6', 'QA', 30);* and script: * * *register './libs/cqlstorageudf-1.0-SNAPSHOT.jar';* *DEFINE FromCqlColumn com.megatome.pig.piggybank.tuple.FromCqlColumn();* *rows = LOAD 'cql://keyspace1/test?page_size=1split_size=4where_clause=age%3D30' USING CqlStorage();* *dump rows;* *ILLUSTRATE rows;* *describe rows;* *A = FOREACH rows GENERATE FLATTEN(title);* *dump A;* *values3 = FOREACH A GENERATE FromCqlColumn(title) AS title;* *dump values3;* *describe values3;* -- I have this error: - | rows | id:chararray | age:int | title:chararray | - | | (id, 5)| (age, 30) | (title, QA) | - rows: {id: chararray,age: int,title: chararray} ... (title,QA) (title,QA) .. 2013-09-02 16:40:52,454 [Thread-11] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0003 *java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.Tuple* at com.megatome.pig.piggybank.tuple.ColumnBase.exec(ColumnBase.java:32) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:434) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:340) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:372) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:297) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) 2013-09-02 16:40:52,832 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_local_0003 8-| Regards ... Miguel Angel Martín Junquera Analyst Engineer. miguelangel.mar...@brainsins.com 2013/9/2 Miguel Angel Martin junquera mianmarjun.mailingl...@gmail.com hi all: More info : https://issues.apache.org/jira/browse/CASSANDRA-5941 I tried this (and gen. cassandra 1.2.9) but do not work for me, git clone http://git-wip-us.apache.org/repos/asf/cassandra.git cd cassandra git checkout cassandra-1.2 patch -p1
Re: row cache
On 09/01/2013 03:06 PM, Faraaz Sareshwala wrote: Yes, that is correct. The SerializingCacheProvider stores row cache contents off heap. I believe you need JNA enabled for this though. Someone please correct me if I am wrong here. The ConcurrentLinkedHashCacheProvider stores row cache contents on the java heap itself. Naming things is hard. Both caches are in memory and are backed by a ConcurrentLinkekHashMap. In the case of the SerializingCacheProvider the *values* are stored in off heap buffers. Both must store a half dozen or so objects (on heap) per entry (org.apache.cassandra.cache.RowCacheKey, com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$WeightedValue, java.util.concurrent.ConcurrentHashMap$HashEntry, etc). It would probably be better to call this a mixed-heap rather than off-heap cache. You may find the number of entires you can hold without gc problems to be surprising low (relative to say memcached, or physical memory on modern hardware). Invalidating a column with SerializingCacheProvider invalidates the entire row while with ConcurrentLinkedHashCacheProvider it does not. SerializingCacheProvider does not require JNA. Both also use memory estimation of the size (of the values only) to determine the total number of entries retained. Estimating the size of the totally on-heap ConcurrentLinkedHashCacheProvider has historically been dicey since we switched from sizing in entries, and it has been removed in 2.0.0. As said elsewhere in this thread the utility of the row cache varies from absolutely essential to source of numerous problems depending on the specifics of the data model and request distribution.
RE: read ?
To get an accurate picture you should force a full GC on each node, the heap utilization can be misleading since there can be a lot of things in the heap with no strong references. There is a number of factors that can lead to this. For a true comparison I would recommend using jconsole and call dumpHeap on com.sun.management:type=HotSpotDiagnostic with the 2nd param true (force GC). Then open the heap dump up in a tool like yourkit and you will get a better comparison and also it will tell you what it is that's taking the space. Chris From: Langston, Jim [mailto:jim.langs...@compuware.com] Sent: Tuesday, September 03, 2013 8:20 AM To: user@cassandra.apache.org Subject: read ? Hi all, Quick question I currently am looking at a 4 node cluster and I have currently stopped all writing to Cassandra, with the reads continuing. I'm trying to understand the utilization of memory within the JVM. nodetool info on each of the nodes shows them all growing in footprint, 2 of the three at a greater rate. On the restart of Cassandra each were at about 100MB, after 2 days, each of the following are at: Heap Memory (MB) : 798.41 / 3052.00 Heap Memory (MB) : 370.44 / 3052.00 Heap Memory (MB) : 549.73 / 3052.00 Heap Memory (MB) : 481.89 / 3052.00 Ring configuration: Address RackStatus State LoadOwns Token 127605887595351923798765477786913079296 x 1d Up Normal 4.38 GB 25.00% 0 x 1d Up Normal 4.17 GB 25.00% 42535295865117307932921825928971026432 x 1d Up Normal 4.19 GB 25.00% 85070591730234615865843651857942052864 x 1d Up Normal 4.14 GB 25.00% 127605887595351923798765477786913079296 What I'm not sure of is what the growth is different between each ? and why that growth is being created by activity that is read only. Is Cassandra caching and holding the read data ? I currently have caching turned off for the key/row. Also as part of the info command Key Cache: size 0 (bytes), capacity 0 (bytes), 0 hits, 0 requests, NaN recent hit rate, 14400 save period in seconds Row Cache: size 0 (bytes), capacity 0 (bytes), 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds Thanks, Jim
Re: Recomended storage choice for Cassandra on Amazon m1.xlarge instance
You benefit from putting commit log on separate drive only if this drive is an isolated spinning device. EC2 ephemeral is a virtual device, so I don't think it makes sense to put commit log on a separated drive. I would build raid0 from 4 drives and put everything their. But it would be interesting to compare different configurations. Thank you, Andrey On Mon, Sep 2, 2013 at 7:11 PM, Renat Gilfanov gren...@mail.ru wrote: Hello, I'd like to ask what is the best options of separating commit log and data on Amazon m1.xlarge instance, given 4x420 Gb attached storages and EBS volume ? As far as I understand, the EBS is not the choice and it's recomended to use attached storages instead. Is it better to combine 4 ephemeral drives in 2 raid0 (or raid1 ?), and store data on the first and commit log on the second? Or may be trying other combinations like 1 attached storage for commit log, and 3 others grouped in raid0 for data? Thank you.
Re: Versioning in cassandra
Jan, The solution you gave works spot on, but there is one more requirement I forgot to mention. Following is my table structure CREATE TABLE file ( id text, contenttype text, createdby text, createdtime timestamp, description text, name text, parentid text, version timestamp, PRIMARY KEY (id, version) ) WITH CLUSTERING ORDER BY (version DESC); The query (select * from file where id = 'xxx' limit 1;) provided solves the problem of finding the latest version file. But I have one more requirement of finding all the latest version files having parentid say 'yyy'. Please suggest how can this query be achieved. Dawood On Tue, Sep 3, 2013 at 12:43 AM, dawood abdullah muhammed.daw...@gmail.comwrote: In my case version can be timestamp as well. What do you suggest version number to be, do you see any problems if I keep version as counter / timestamp ? On Tue, Sep 3, 2013 at 12:22 AM, Jan Algermissen jan.algermis...@nordsc.com wrote: On 02.09.2013, at 20:44, dawood abdullah muhammed.daw...@gmail.com wrote: Requirement is like I have a column family say File create table file(id text primary key, fname text, version int, mimetype text, content text); Say, I have few records inserted, when I modify an existing record (content is modified) a new version needs to be created. As I need to have provision to revert to back any old version whenever required. So, can version be a timestamp? Or does it need to be an integer? In the former case, make use of C*'s ordering like so: CREATE TABLE file ( file_id text, version timestamp, fname text, PRIMARY KEY (file_id,version) ) WITH CLUSTERING ORDER BY (version DESC); Get the latest file version with select * from file where file_id = 'xxx' limit 1; If it has to be an integer, use counter columns. Jan Regards, Dawood On Mon, Sep 2, 2013 at 10:47 PM, Jan Algermissen jan.algermis...@nordsc.com wrote: Hi Dawood, On 02.09.2013, at 16:36, dawood abdullah muhammed.daw...@gmail.com wrote: Hi I have a requirement of versioning to be done in Cassandra. Following is my column family definition create table file_details(id text primary key, fname text, version int, mimetype text); I have a secondary index created on fname column. Whenever I do an insert for the same 'fname', the version should be incremented. And when I retrieve a row with fname it should return me the latest version row. Is there a better way to do in Cassandra? Please suggest what approach needs to be taken. Can you explain more about your use case? If the version need not be a small number, but could be a timestamp, you could make use of C*'s ordering feature , have the database set the new version as a timestamp and retrieve the latest one with a simple LIMIT 1 query. (I'll explain more when this is an option for you). Jan P.S. Me being a REST/HTTP head, an alarm rings when I see 'version' next to 'mimetype' :-) What exactly are you versioning here? Maybe we can even change the situation from a functional POV? Regards, Dawood
RE: read ?
Does it actually OOM eventually? There will be a certain amount of object allocation for reads (or anything) which will see the heap creep up until a GC, but at ~500mb or so of a 8gb heap there is little reason for the JVM to do it so it probably just ignores it to save processing. Even the young gen wont require a collection at this size. Which version of Cassandra are you running? Previous to 1.2 a lot of metadata about the sstables took considerable heap which could cause additional memory utilization. Chris From: Langston, Jim [mailto:jim.langs...@compuware.com] Sent: Tuesday, September 03, 2013 11:33 AM To: user@cassandra.apache.org Subject: Re: read ? Thanks Chris, I have about 8 heap dumps that I have been looking at. I have been trying to isolate as to why I have be dumping heap, I've started by removing the apps that write to cassandra and eliminating work that would entail. I am left with just the apps that are reading the data and from the heap dumps it looks like Cassandra Column methods being called, because there are so many objects, it is difficult to ascertain exactly what the problem may be. That prompted my query, trying to quickly determine if Cassandra holds objects that have been used for reading, and if so, why, and more importantly if something can be done. Jim From: Lohfink, Chris chris.lohf...@digi.commailto:chris.lohf...@digi.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Tue, 3 Sep 2013 11:12:19 -0500 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: RE: read ? To get an accurate picture you should force a full GC on each node, the heap utilization can be misleading since there can be a lot of things in the heap with no strong references. There is a number of factors that can lead to this. For a true comparison I would recommend using jconsole and call dumpHeap on com.sun.management:type=HotSpotDiagnostic with the 2nd param true (force GC). Then open the heap dump up in a tool like yourkit and you will get a better comparison and also it will tell you what it is that's taking the space. Chris From: Langston, Jim [mailto:jim.langs...@compuware.com] Sent: Tuesday, September 03, 2013 8:20 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: read ? Hi all, Quick question I currently am looking at a 4 node cluster and I have currently stopped all writing to Cassandra, with the reads continuing. I'm trying to understand the utilization of memory within the JVM. nodetool info on each of the nodes shows them all growing in footprint, 2 of the three at a greater rate. On the restart of Cassandra each were at about 100MB, after 2 days, each of the following are at: Heap Memory (MB) : 798.41 / 3052.00 Heap Memory (MB) : 370.44 / 3052.00 Heap Memory (MB) : 549.73 / 3052.00 Heap Memory (MB) : 481.89 / 3052.00 Ring configuration: Address RackStatus State LoadOwns Token 127605887595351923798765477786913079296 x 1d Up Normal 4.38 GB 25.00% 0 x 1d Up Normal 4.17 GB 25.00% 42535295865117307932921825928971026432 x 1d Up Normal 4.19 GB 25.00% 85070591730234615865843651857942052864 x 1d Up Normal 4.14 GB 25.00% 127605887595351923798765477786913079296 What I'm not sure of is what the growth is different between each ? and why that growth is being created by activity that is read only. Is Cassandra caching and holding the read data ? I currently have caching turned off for the key/row. Also as part of the info command Key Cache: size 0 (bytes), capacity 0 (bytes), 0 hits, 0 requests, NaN recent hit rate, 14400 save period in seconds Row Cache: size 0 (bytes), capacity 0 (bytes), 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds Thanks, Jim
Re: Upgrade from 1.0.9 to 1.2.8
Ah. I was going by the upgrade recommendations in the NEWS.txt file in the cassandra source tree, which didn't make mention of that version (1.0.11) whatsoever. I didn't see any show-stoppers that would have prevented me from going straight from 1.0.9 to 1.2.x. https://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=NEWS.txt;hb=refs/tags/cassandra-1.2.4 Looks like a multi-step upgrade is the way I'll be proceeding. Thanks for the insight, everyone. MN On 09/02/2013 11:04 AM, Jeremiah D Jordan wrote: 1.0.9 - 1.0.12 - 1.1.12 - 1.2.x? Because this fix in 1.0.11: * fix 1.0.x node join to mixed version cluster, other nodes = 1.1 (CASSANDRA-4195) -Jeremiah -- Mike Neir Liquid Web, Inc. Infrastructure Administrator
Re: Versioning in cassandra
create secondary index over parentid. OR make it part of clustering key -Vivek On Tue, Sep 3, 2013 at 10:42 PM, dawood abdullah muhammed.daw...@gmail.comwrote: Jan, The solution you gave works spot on, but there is one more requirement I forgot to mention. Following is my table structure CREATE TABLE file ( id text, contenttype text, createdby text, createdtime timestamp, description text, name text, parentid text, version timestamp, PRIMARY KEY (id, version) ) WITH CLUSTERING ORDER BY (version DESC); The query (select * from file where id = 'xxx' limit 1;) provided solves the problem of finding the latest version file. But I have one more requirement of finding all the latest version files having parentid say 'yyy'. Please suggest how can this query be achieved. Dawood On Tue, Sep 3, 2013 at 12:43 AM, dawood abdullah muhammed.daw...@gmail.com wrote: In my case version can be timestamp as well. What do you suggest version number to be, do you see any problems if I keep version as counter / timestamp ? On Tue, Sep 3, 2013 at 12:22 AM, Jan Algermissen jan.algermis...@nordsc.com wrote: On 02.09.2013, at 20:44, dawood abdullah muhammed.daw...@gmail.com wrote: Requirement is like I have a column family say File create table file(id text primary key, fname text, version int, mimetype text, content text); Say, I have few records inserted, when I modify an existing record (content is modified) a new version needs to be created. As I need to have provision to revert to back any old version whenever required. So, can version be a timestamp? Or does it need to be an integer? In the former case, make use of C*'s ordering like so: CREATE TABLE file ( file_id text, version timestamp, fname text, PRIMARY KEY (file_id,version) ) WITH CLUSTERING ORDER BY (version DESC); Get the latest file version with select * from file where file_id = 'xxx' limit 1; If it has to be an integer, use counter columns. Jan Regards, Dawood On Mon, Sep 2, 2013 at 10:47 PM, Jan Algermissen jan.algermis...@nordsc.com wrote: Hi Dawood, On 02.09.2013, at 16:36, dawood abdullah muhammed.daw...@gmail.com wrote: Hi I have a requirement of versioning to be done in Cassandra. Following is my column family definition create table file_details(id text primary key, fname text, version int, mimetype text); I have a secondary index created on fname column. Whenever I do an insert for the same 'fname', the version should be incremented. And when I retrieve a row with fname it should return me the latest version row. Is there a better way to do in Cassandra? Please suggest what approach needs to be taken. Can you explain more about your use case? If the version need not be a small number, but could be a timestamp, you could make use of C*'s ordering feature , have the database set the new version as a timestamp and retrieve the latest one with a simple LIMIT 1 query. (I'll explain more when this is an option for you). Jan P.S. Me being a REST/HTTP head, an alarm rings when I see 'version' next to 'mimetype' :-) What exactly are you versioning here? Maybe we can even change the situation from a functional POV? Regards, Dawood
Re: Versioning in cassandra
My bad. I did miss out to read latest version part. -Vivek On Tue, Sep 3, 2013 at 11:20 PM, dawood abdullah muhammed.daw...@gmail.comwrote: I have tried with both the options creating secondary index and also tried adding parentid to primary key, but I am getting all the files with parentid 'yyy', what I want is the latest version of file with the combination of parentid, fileid. Say below are the records inserted in the file table: insert into file (id, parentid, version, contenttype, description, name) values ('f1', 'd1', '2011-03-04', 'pdf', 'f1 file', 'file1'); insert into file (id, parentid, version, contenttype, description, name) values ('f1', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1'); insert into file (id, parentid, version, contenttype, description, name) values ('f2', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1'); insert into file (id, parentid, version, contenttype, description, name) values ('f2', 'd1', '2011-03-06', 'pdf', 'f1 file', 'file1'); I want to write a query which returns me second and last record and not the first and third record, because for the first and third record there exists a latest version, for the combination of id and parentid. I am confused If at all this is achievable, please suggest. Dawood On Tue, Sep 3, 2013 at 10:58 PM, Vivek Mishra mishra.v...@gmail.comwrote: create secondary index over parentid. OR make it part of clustering key -Vivek On Tue, Sep 3, 2013 at 10:42 PM, dawood abdullah muhammed.daw...@gmail.com wrote: Jan, The solution you gave works spot on, but there is one more requirement I forgot to mention. Following is my table structure CREATE TABLE file ( id text, contenttype text, createdby text, createdtime timestamp, description text, name text, parentid text, version timestamp, PRIMARY KEY (id, version) ) WITH CLUSTERING ORDER BY (version DESC); The query (select * from file where id = 'xxx' limit 1;) provided solves the problem of finding the latest version file. But I have one more requirement of finding all the latest version files having parentid say 'yyy'. Please suggest how can this query be achieved. Dawood On Tue, Sep 3, 2013 at 12:43 AM, dawood abdullah muhammed.daw...@gmail.com wrote: In my case version can be timestamp as well. What do you suggest version number to be, do you see any problems if I keep version as counter / timestamp ? On Tue, Sep 3, 2013 at 12:22 AM, Jan Algermissen jan.algermis...@nordsc.com wrote: On 02.09.2013, at 20:44, dawood abdullah muhammed.daw...@gmail.com wrote: Requirement is like I have a column family say File create table file(id text primary key, fname text, version int, mimetype text, content text); Say, I have few records inserted, when I modify an existing record (content is modified) a new version needs to be created. As I need to have provision to revert to back any old version whenever required. So, can version be a timestamp? Or does it need to be an integer? In the former case, make use of C*'s ordering like so: CREATE TABLE file ( file_id text, version timestamp, fname text, PRIMARY KEY (file_id,version) ) WITH CLUSTERING ORDER BY (version DESC); Get the latest file version with select * from file where file_id = 'xxx' limit 1; If it has to be an integer, use counter columns. Jan Regards, Dawood On Mon, Sep 2, 2013 at 10:47 PM, Jan Algermissen jan.algermis...@nordsc.com wrote: Hi Dawood, On 02.09.2013, at 16:36, dawood abdullah muhammed.daw...@gmail.com wrote: Hi I have a requirement of versioning to be done in Cassandra. Following is my column family definition create table file_details(id text primary key, fname text, version int, mimetype text); I have a secondary index created on fname column. Whenever I do an insert for the same 'fname', the version should be incremented. And when I retrieve a row with fname it should return me the latest version row. Is there a better way to do in Cassandra? Please suggest what approach needs to be taken. Can you explain more about your use case? If the version need not be a small number, but could be a timestamp, you could make use of C*'s ordering feature , have the database set the new version as a timestamp and retrieve the latest one with a simple LIMIT 1 query. (I'll explain more when this is an option for you). Jan P.S. Me being a REST/HTTP head, an alarm rings when I see 'version' next to 'mimetype' :-) What exactly are you versioning here? Maybe we can even change the situation from a functional POV? Regards, Dawood
RE: map/reduce performance time and sstable readerŠ.
I am trying to do the same thing, as in our project, we want to load the data from Cassandra into Hadoop cluster, and SSTable is one obvious option, as you can get the changed data since last batch loading directly from the SSTable incremental backup files. But, based on so far my research (I maybe wrong, as I just did limited research about the SSTable, I hope someone in this forum can tell me that I am wrong), it maybe is NOT a good option: 1) sstable2json looks like NOT a scalable solution to get the data out from the Cassandra, and it needs the access to data directory to get some meta data from system keyspace for the column family data dumped, which maybe is not an option in your MR environment.2) So far I am thinking reuse the same API as being used in the sstable2json, but I have to provide these metadata in the API, like validator types/partitioner etc. I am surprised that as a backup, the column family SSTable dump files DOESN't contain these information by itself. Shouldn't it find out this from the SSTable files(ONLY) by itself?3) The big trouble comes this if you want to parse the SSTables in your MR code. The API internal will load the Index/Compression_Info information from the Index/Compression files, which it assumes located in the same place as the data file, but it will use the FileSteam internal. So if these data files are in the DFS (Distributed File System), so far, I didn't find a way to tell the API to use the stream from the DFS, instead of Local File Input stream. So basically you have 2 options: a) Copy these files from DSF to local file system (Same as what Knewton guys did at https://github.com/Knewton/KassandraMRHelper) b) Develop your own API to access the SStable files directly ( My guess is that Netflix guys probably did this way. They have a project called Aegisthus (See here: http://techblog.netflix.com/2012/02/aegisthus-bulk-data-pipeline-out-of.html), but it is not open source.4) About the performance, I am not sure, as SSTable2Json underline is using the same Cassandra API, but running in MR give us some support in scalability, as we can reuse the Hadoop framework for a lot of benefits it can bring. Yong From: dean.hil...@nrel.gov To: user@cassandra.apache.org Date: Fri, 30 Aug 2013 07:25:09 -0600 Subject: map/reduce performance time and sstable readerŠ. Has anyone done performance tests on sstable reading vs. M/R? I did a quick test on reading all SSTAbles in a LCS column family on 23 tables and took the average time it took sstable2json(to /dev/null to make it faster) which was 7 seconds per table. (reading to stdout took 16 seconds per table). This then worked out to an estimation of 12.5 hours up to 27 hours(from to stdout calculation). I am suspecting the map/reduce time may be much worse since there are not as many repeated rows in LCS Ie. I am wondering if I should just read from SSTAbles directly instead of map/reduce? I am about to dig around in the code of M/R and sstable2json to see what each is doing specifically. Thanks, Dean
Re: Cassandra cluster migration in Amazon EC2
On Mon, Sep 2, 2013 at 4:21 PM, Renat Gilfanov gren...@mail.ru wrote: - Group 3 of storages into raid0 array, move data directory to the raid0, and commit log - to the 4th left storage. - As far as I understand, separation of commit log and data directory should make performance better - but what about separation the OS from those two - is it worth doing? Nope. Best practice for amazon is ephemeral disks, and RAID0 for data + commit log. - What are the steps to perform such migration? Will it be possible to perform it without downtime, restarting node by node with new configuration applied? I'm especially worried about IP changes, when we'll uprade the instance type. What's the recomended way to handle those IP changes? Just set auto_bootstrap:false in cassandra.yaml to change the IP address of a node to which you have copied all the data its token had before its IP address changed and therefore does not need to be bootstrapped. =Rob
Re: Versioning in cassandra
I have tried with both the options creating secondary index and also tried adding parentid to primary key, but I am getting all the files with parentid 'yyy', what I want is the latest version of file with the combination of parentid, fileid. Say below are the records inserted in the file table: insert into file (id, parentid, version, contenttype, description, name) values ('f1', 'd1', '2011-03-04', 'pdf', 'f1 file', 'file1'); insert into file (id, parentid, version, contenttype, description, name) values ('f1', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1'); insert into file (id, parentid, version, contenttype, description, name) values ('f2', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1'); insert into file (id, parentid, version, contenttype, description, name) values ('f2', 'd1', '2011-03-06', 'pdf', 'f1 file', 'file1'); I want to write a query which returns me second and last record and not the first and third record, because for the first and third record there exists a latest version, for the combination of id and parentid. I am confused If at all this is achievable, please suggest. Dawood On Tue, Sep 3, 2013 at 10:58 PM, Vivek Mishra mishra.v...@gmail.com wrote: create secondary index over parentid. OR make it part of clustering key -Vivek On Tue, Sep 3, 2013 at 10:42 PM, dawood abdullah muhammed.daw...@gmail.com wrote: Jan, The solution you gave works spot on, but there is one more requirement I forgot to mention. Following is my table structure CREATE TABLE file ( id text, contenttype text, createdby text, createdtime timestamp, description text, name text, parentid text, version timestamp, PRIMARY KEY (id, version) ) WITH CLUSTERING ORDER BY (version DESC); The query (select * from file where id = 'xxx' limit 1;) provided solves the problem of finding the latest version file. But I have one more requirement of finding all the latest version files having parentid say 'yyy'. Please suggest how can this query be achieved. Dawood On Tue, Sep 3, 2013 at 12:43 AM, dawood abdullah muhammed.daw...@gmail.com wrote: In my case version can be timestamp as well. What do you suggest version number to be, do you see any problems if I keep version as counter / timestamp ? On Tue, Sep 3, 2013 at 12:22 AM, Jan Algermissen jan.algermis...@nordsc.com wrote: On 02.09.2013, at 20:44, dawood abdullah muhammed.daw...@gmail.com wrote: Requirement is like I have a column family say File create table file(id text primary key, fname text, version int, mimetype text, content text); Say, I have few records inserted, when I modify an existing record (content is modified) a new version needs to be created. As I need to have provision to revert to back any old version whenever required. So, can version be a timestamp? Or does it need to be an integer? In the former case, make use of C*'s ordering like so: CREATE TABLE file ( file_id text, version timestamp, fname text, PRIMARY KEY (file_id,version) ) WITH CLUSTERING ORDER BY (version DESC); Get the latest file version with select * from file where file_id = 'xxx' limit 1; If it has to be an integer, use counter columns. Jan Regards, Dawood On Mon, Sep 2, 2013 at 10:47 PM, Jan Algermissen jan.algermis...@nordsc.com wrote: Hi Dawood, On 02.09.2013, at 16:36, dawood abdullah muhammed.daw...@gmail.com wrote: Hi I have a requirement of versioning to be done in Cassandra. Following is my column family definition create table file_details(id text primary key, fname text, version int, mimetype text); I have a secondary index created on fname column. Whenever I do an insert for the same 'fname', the version should be incremented. And when I retrieve a row with fname it should return me the latest version row. Is there a better way to do in Cassandra? Please suggest what approach needs to be taken. Can you explain more about your use case? If the version need not be a small number, but could be a timestamp, you could make use of C*'s ordering feature , have the database set the new version as a timestamp and retrieve the latest one with a simple LIMIT 1 query. (I'll explain more when this is an option for you). Jan P.S. Me being a REST/HTTP head, an alarm rings when I see 'version' next to 'mimetype' :-) What exactly are you versioning here? Maybe we can even change the situation from a functional POV? Regards, Dawood
Re: read ?
Thanks Chris, I have about 8 heap dumps that I have been looking at. I have been trying to isolate as to why I have be dumping heap, I've started by removing the apps that write to cassandra and eliminating work that would entail. I am left with just the apps that are reading the data and from the heap dumps it looks like Cassandra Column methods being called, because there are so many objects, it is difficult to ascertain exactly what the problem may be. That prompted my query, trying to quickly determine if Cassandra holds objects that have been used for reading, and if so, why, and more importantly if something can be done. Jim From: Lohfink, Chris chris.lohf...@digi.commailto:chris.lohf...@digi.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Tue, 3 Sep 2013 11:12:19 -0500 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: RE: read ? To get an accurate picture you should force a full GC on each node, the heap utilization can be misleading since there can be a lot of things in the heap with no strong references. There is a number of factors that can lead to this. For a true comparison I would recommend using jconsole and call dumpHeap on com.sun.management:type=HotSpotDiagnostic with the 2nd param true (force GC). Then open the heap dump up in a tool like yourkit and you will get a better comparison and also it will tell you what it is that’s taking the space. Chris From: Langston, Jim [mailto:jim.langs...@compuware.com] Sent: Tuesday, September 03, 2013 8:20 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: read ? Hi all, Quick question I currently am looking at a 4 node cluster and I have currently stopped all writing to Cassandra, with the reads continuing. I'm trying to understand the utilization of memory within the JVM. nodetool info on each of the nodes shows them all growing in footprint, 2 of the three at a greater rate. On the restart of Cassandra each were at about 100MB, after 2 days, each of the following are at: Heap Memory (MB) : 798.41 / 3052.00 Heap Memory (MB) : 370.44 / 3052.00 Heap Memory (MB) : 549.73 / 3052.00 Heap Memory (MB) : 481.89 / 3052.00 Ring configuration: Address RackStatus State LoadOwns Token 127605887595351923798765477786913079296 x 1d Up Normal 4.38 GB 25.00% 0 x 1d Up Normal 4.17 GB 25.00% 42535295865117307932921825928971026432 x 1d Up Normal 4.19 GB 25.00% 85070591730234615865843651857942052864 x 1d Up Normal 4.14 GB 25.00% 127605887595351923798765477786913079296 What I'm not sure of is what the growth is different between each ? and why that growth is being created by activity that is read only. Is Cassandra caching and holding the read data ? I currently have caching turned off for the key/row. Also as part of the info command Key Cache: size 0 (bytes), capacity 0 (bytes), 0 hits, 0 requests, NaN recent hit rate, 14400 save period in seconds Row Cache: size 0 (bytes), capacity 0 (bytes), 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds Thanks, Jim
Re: map/reduce performance time and sstable readerŠ.
We are considering creating our own InputFormat for hadoop and running the tasktrackers on every 3rd node(ie. RF=3) such that we cover all ranges. Our M/R overhead appears to be 13 days vs. 12.5 hours on just reading SSTAbles directly on our current data set. I personally don't think parsing SSTables(using the hadoop M/R framework) is a big deal from us since we run task trackers on the cassandra nodes we need it on. Ie. We don't need to copy to DFS to do this I believe(at least not in our situation). I already wrote a client on the SSTableReader parsing out sstables to take a look at some of our data while our 13 day M/R job is running(we are 4 days in already with no failures and no performance degradation). later, Dean From: java8964 java8964 java8...@hotmail.commailto:java8...@hotmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Tuesday, September 3, 2013 12:06 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: RE: map/reduce performance time and sstable readerŠ. I am trying to do the same thing, as in our project, we want to load the data from Cassandra into Hadoop cluster, and SSTable is one obvious option, as you can get the changed data since last batch loading directly from the SSTable incremental backup files. But, based on so far my research (I maybe wrong, as I just did limited research about the SSTable, I hope someone in this forum can tell me that I am wrong), it maybe is NOT a good option: 1) sstable2json looks like NOT a scalable solution to get the data out from the Cassandra, and it needs the access to data directory to get some meta data from system keyspace for the column family data dumped, which maybe is not an option in your MR environment. 2) So far I am thinking reuse the same API as being used in the sstable2json, but I have to provide these metadata in the API, like validator types/partitioner etc. I am surprised that as a backup, the column family SSTable dump files DOESN't contain these information by itself. Shouldn't it find out this from the SSTable files(ONLY) by itself? 3) The big trouble comes this if you want to parse the SSTables in your MR code. The API internal will load the Index/Compression_Info information from the Index/Compression files, which it assumes located in the same place as the data file, but it will use the FileSteam internal. So if these data files are in the DFS (Distributed File System), so far, I didn't find a way to tell the API to use the stream from the DFS, instead of Local File Input stream. So basically you have 2 options: a) Copy these files from DSF to local file system (Same as what Knewton guys did at https://github.com/Knewton/KassandraMRHelper) b) Develop your own API to access the SStable files directly ( My guess is that Netflix guys probably did this way. They have a project called Aegisthushttp://en.wikipedia.org/wiki/Cassandra#History (See here: http://techblog.netflix.com/2012/02/aegisthus-bulk-data-pipeline-out-of.html), but it is not open source. 4) About the performance, I am not sure, as SSTable2Json underline is using the same Cassandra API, but running in MR give us some support in scalability, as we can reuse the Hadoop framework for a lot of benefits it can bring. Yong From: dean.hil...@nrel.govmailto:dean.hil...@nrel.gov To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Fri, 30 Aug 2013 07:25:09 -0600 Subject: map/reduce performance time and sstable readerŠ. Has anyone done performance tests on sstable reading vs. M/R? I did a quick test on reading all SSTAbles in a LCS column family on 23 tables and took the average time it took sstable2json(to /dev/null to make it faster) which was 7 seconds per table. (reading to stdout took 16 seconds per table). This then worked out to an estimation of 12.5 hours up to 27 hours(from to stdout calculation). I am suspecting the map/reduce time may be much worse since there are not as many repeated rows in LCS Ie. I am wondering if I should just read from SSTAbles directly instead of map/reduce? I am about to dig around in the code of M/R and sstable2json to see what each is doing specifically. Thanks, Dean
Re: Versioning in cassandra
try the following. -ml -- put this in file and run using 'cqlsh -f file DROP KEYSPACE latest; CREATE KEYSPACE latest WITH replication = { 'class': 'SimpleStrategy', 'replication_factor' : 1 }; USE latest; CREATE TABLE file ( parentid text, -- row_key, same for each version id text, -- column_key, same for each version contenttype maptimestamp, text, -- differs by version, version is the key to the map PRIMARY KEY (parentid, id) ); update file set contenttype = contenttype + {'2011-03-04':'pdf1'} where parentid = 'd1' and id = 'f1'; update file set contenttype = contenttype + {'2011-03-05':'pdf2'} where parentid = 'd1' and id = 'f1'; update file set contenttype = contenttype + {'2011-03-04':'pdf3'} where parentid = 'd1' and id = 'f2'; update file set contenttype = contenttype + {'2011-03-05':'pdf4'} where parentid = 'd1' and id = 'f2'; select * from file where parentid = 'd1'; -- returns: -- parentid | id | contenttype ++-- -- d1 | f1 | {'2011-03-04 00:00:00-0500': 'pdf1', '2011-03-05 00:00:00-0500': 'pdf2'} -- d1 | f2 | {'2011-03-04 00:00:00-0500': 'pdf3', '2011-03-05 00:00:00-0500': 'pdf4'} -- use an app to pop off the latest version from the map -- map other varying fields using the same technique as used for contenttype On Tue, Sep 3, 2013 at 2:31 PM, Vivek Mishra mishra.v...@gmail.com wrote: create table file(id text , parentid text,contenttype text,version timestamp, descr text, name text, PRIMARY KEY(id,version) ) WITH CLUSTERING ORDER BY (version DESC); insert into file (id, parentid, version, contenttype, descr, name) values ('f2', 'd1', '2011-03-06', 'pdf', 'f2 file', 'file1'); insert into file (id, parentid, version, contenttype, descr, name) values ('f2', 'd1', '2011-03-05', 'pdf', 'f2 file', 'file1'); insert into file (id, parentid, version, contenttype, descr, name) values ('f1', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1'); insert into file (id, parentid, version, contenttype, descr, name) values ('f1', 'd1', '2011-03-04', 'pdf', 'f1 file', 'file1'); create index on file(parentid); select * from file where id='f1' and parentid='d1' limit 1; select * from file where parentid='d1' limit 1; Will it work for you? -Vivek On Tue, Sep 3, 2013 at 11:29 PM, Vivek Mishra mishra.v...@gmail.comwrote: My bad. I did miss out to read latest version part. -Vivek On Tue, Sep 3, 2013 at 11:20 PM, dawood abdullah muhammed.daw...@gmail.com wrote: I have tried with both the options creating secondary index and also tried adding parentid to primary key, but I am getting all the files with parentid 'yyy', what I want is the latest version of file with the combination of parentid, fileid. Say below are the records inserted in the file table: insert into file (id, parentid, version, contenttype, description, name) values ('f1', 'd1', '2011-03-04', 'pdf', 'f1 file', 'file1'); insert into file (id, parentid, version, contenttype, description, name) values ('f1', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1'); insert into file (id, parentid, version, contenttype, description, name) values ('f2', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1'); insert into file (id, parentid, version, contenttype, description, name) values ('f2', 'd1', '2011-03-06', 'pdf', 'f1 file', 'file1'); I want to write a query which returns me second and last record and not the first and third record, because for the first and third record there exists a latest version, for the combination of id and parentid. I am confused If at all this is achievable, please suggest. Dawood On Tue, Sep 3, 2013 at 10:58 PM, Vivek Mishra mishra.v...@gmail.comwrote: create secondary index over parentid. OR make it part of clustering key -Vivek On Tue, Sep 3, 2013 at 10:42 PM, dawood abdullah muhammed.daw...@gmail.com wrote: Jan, The solution you gave works spot on, but there is one more requirement I forgot to mention. Following is my table structure CREATE TABLE file ( id text, contenttype text, createdby text, createdtime timestamp, description text, name text, parentid text, version timestamp, PRIMARY KEY (id, version) ) WITH CLUSTERING ORDER BY (version DESC); The query (select * from file where id = 'xxx' limit 1;) provided solves the problem of finding the latest version file. But I have one more requirement of finding all the latest version files having parentid say 'yyy'. Please suggest how can this query be achieved. Dawood On Tue, Sep 3, 2013 at 12:43 AM, dawood abdullah muhammed.daw...@gmail.com wrote: In my case version can be timestamp as well. What do you suggest version number to be, do you see any problems if I keep version as counter / timestamp ? On Tue, Sep 3, 2013 at 12:22 AM, Jan Algermissen jan.algermis...@nordsc.com wrote: On 02.09.2013, at 20:44, dawood abdullah
Re: Versioning in cassandra
create table file(id text , parentid text,contenttype text,version timestamp, descr text, name text, PRIMARY KEY(id,version) ) WITH CLUSTERING ORDER BY (version DESC); insert into file (id, parentid, version, contenttype, descr, name) values ('f2', 'd1', '2011-03-06', 'pdf', 'f2 file', 'file1'); insert into file (id, parentid, version, contenttype, descr, name) values ('f2', 'd1', '2011-03-05', 'pdf', 'f2 file', 'file1'); insert into file (id, parentid, version, contenttype, descr, name) values ('f1', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1'); insert into file (id, parentid, version, contenttype, descr, name) values ('f1', 'd1', '2011-03-04', 'pdf', 'f1 file', 'file1'); create index on file(parentid); select * from file where id='f1' and parentid='d1' limit 1; select * from file where parentid='d1' limit 1; Will it work for you? -Vivek On Tue, Sep 3, 2013 at 11:29 PM, Vivek Mishra mishra.v...@gmail.com wrote: My bad. I did miss out to read latest version part. -Vivek On Tue, Sep 3, 2013 at 11:20 PM, dawood abdullah muhammed.daw...@gmail.com wrote: I have tried with both the options creating secondary index and also tried adding parentid to primary key, but I am getting all the files with parentid 'yyy', what I want is the latest version of file with the combination of parentid, fileid. Say below are the records inserted in the file table: insert into file (id, parentid, version, contenttype, description, name) values ('f1', 'd1', '2011-03-04', 'pdf', 'f1 file', 'file1'); insert into file (id, parentid, version, contenttype, description, name) values ('f1', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1'); insert into file (id, parentid, version, contenttype, description, name) values ('f2', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1'); insert into file (id, parentid, version, contenttype, description, name) values ('f2', 'd1', '2011-03-06', 'pdf', 'f1 file', 'file1'); I want to write a query which returns me second and last record and not the first and third record, because for the first and third record there exists a latest version, for the combination of id and parentid. I am confused If at all this is achievable, please suggest. Dawood On Tue, Sep 3, 2013 at 10:58 PM, Vivek Mishra mishra.v...@gmail.comwrote: create secondary index over parentid. OR make it part of clustering key -Vivek On Tue, Sep 3, 2013 at 10:42 PM, dawood abdullah muhammed.daw...@gmail.com wrote: Jan, The solution you gave works spot on, but there is one more requirement I forgot to mention. Following is my table structure CREATE TABLE file ( id text, contenttype text, createdby text, createdtime timestamp, description text, name text, parentid text, version timestamp, PRIMARY KEY (id, version) ) WITH CLUSTERING ORDER BY (version DESC); The query (select * from file where id = 'xxx' limit 1;) provided solves the problem of finding the latest version file. But I have one more requirement of finding all the latest version files having parentid say 'yyy'. Please suggest how can this query be achieved. Dawood On Tue, Sep 3, 2013 at 12:43 AM, dawood abdullah muhammed.daw...@gmail.com wrote: In my case version can be timestamp as well. What do you suggest version number to be, do you see any problems if I keep version as counter / timestamp ? On Tue, Sep 3, 2013 at 12:22 AM, Jan Algermissen jan.algermis...@nordsc.com wrote: On 02.09.2013, at 20:44, dawood abdullah muhammed.daw...@gmail.com wrote: Requirement is like I have a column family say File create table file(id text primary key, fname text, version int, mimetype text, content text); Say, I have few records inserted, when I modify an existing record (content is modified) a new version needs to be created. As I need to have provision to revert to back any old version whenever required. So, can version be a timestamp? Or does it need to be an integer? In the former case, make use of C*'s ordering like so: CREATE TABLE file ( file_id text, version timestamp, fname text, PRIMARY KEY (file_id,version) ) WITH CLUSTERING ORDER BY (version DESC); Get the latest file version with select * from file where file_id = 'xxx' limit 1; If it has to be an integer, use counter columns. Jan Regards, Dawood On Mon, Sep 2, 2013 at 10:47 PM, Jan Algermissen jan.algermis...@nordsc.com wrote: Hi Dawood, On 02.09.2013, at 16:36, dawood abdullah muhammed.daw...@gmail.com wrote: Hi I have a requirement of versioning to be done in Cassandra. Following is my column family definition create table file_details(id text primary key, fname text, version int, mimetype text); I have a secondary index created on fname column. Whenever I do an insert for the same 'fname', the version should be incremented. And when I retrieve a row with fname it should return me the latest version
Re: [RELEASE] Apache Cassandra 2.0 released
Thanks for everyone's work on this release! -Jeremiah On Sep 3, 2013, at 8:48 AM, Sylvain Lebresne sylv...@datastax.com wrote: The Cassandra team is very pleased to announce the release of Apache Cassandra version 2.0.0. Cassandra 2.0.0 is a new major release that adds numerous improvements[1,2], including: - Lightweight transactions[4] that offers linearizable consistency. - Experimental Triggers Support[5]. - Numerous enhancements to CQL as well as a new and better version of the native protocol[6]. - Compaction improvements[7] (including a hybrid strategy that combines leveled and size-tiered compaction). - A new faster Thrift Server implementation based on LMAX Disruptor[8]. - Eager retries: avoids query timeout by sending data requests to other replicas if too much time passes on the original request. See the full changelog[1] for more and please make sure to check the release notes[2] for upgrading details. Both source and binary distributions of Cassandra 2.0.0 can be downloaded at: http://cassandra.apache.org/download/ As usual, a debian package is available from the project APT repository[3] (you will need to use the 20x series). The Cassandra team [1]: http://goo.gl/zU4sWv (CHANGES.txt) [2]: http://goo.gl/MrR6Qn (NEWS.txt) [3]: http://wiki.apache.org/cassandra/DebianPackaging [4]: http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0 [5]: http://www.datastax.com/dev/blog/whats-new-in-cassandra-2-0-prototype-triggers-support [6]: http://www.datastax.com/dev/blog/cql-in-cassandra-2-0 [7]: https://issues.apache.org/jira/browse/CASSANDRA-5371 [8]: https://issues.apache.org/jira/browse/CASSANDRA-5582
How to fix host ID collision?
Hello, We have Cassandra cluster with 5 nodes hosted in the Amazon EC2, and I had to restart two of them, so their IPs changed. We use NetworkTopologyStrategy, so I simply updated IPs in the cassandra-topology.properties file. However, as I understood, old IPs remained somewhere in the system keyspace, and now I observe several different exception stacktraces in the log files, including: java.lang.RuntimeException: Host ID collision between active endpoint /new IP and /old IP (id=ab66dd02-96b2-4504-8403-7d066f911698) at org.apache.cassandra.locator.TokenMetadata.updateHostId(TokenMetadata.java:229) at org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:1358) at org.apache.cassandra.service.StorageService.onChange(StorageService.java:1228) at org.apache.cassandra.service.StorageService.onJoin(StorageService.java:1960) at org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:837) at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:915) at org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:50) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) and java.lang.AssertionError: Missing host ID for old IP at org.apache.cassandra.service.StorageProxy.writeHintForMutation(StorageProxy.java:583) at org.apache.cassandra.service.StorageProxy$5.runMayThrow(StorageProxy.java:552) at org.apache.cassandra.service.StorageProxy$HintRunnable.run(StorageProxy.java:1658) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) nodetool status being executed on 3 old nodes, shows old ghost node: Datacenter: DC1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.14.128.109 2.8 GB 141 4.1% 32260392-12c2-4f1a-812e-87fd9a960d10 RAC2 UN 10.24.33.187 2.12 GB 258 42.7% ab66dd02-96b2-4504-8403-7d066f911698 RAC3 UN 10.20.149.165 2.99 GB 251 4.5% a0792f59-20b1-4017-a7f6-88e0c0d7f86f RAC1 DN 10.11.73.104 1.07 GB 2 1.0% null RAC1 Datacenter: DC2 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.34.78.23 2.21 GB 117 0.9% 2acd3766-404d-4cdc-b3e3-7b3b95856f0e RAC1 UN 10.20.23.171 2.22 GB 255 46.8% 67421e3a-1dfc-48a0-88b3-c6dbd64dc9d8 RAC1 Is it possible to fix those IP collisions ? Thanks.
RE: Update-Replace
I have a similar use case but only need to update portion of the row. We basically perform single write (with old and new columns) with very low value of ttl for old columns. From: jan.algermis...@nordsc.com Subject: Update-Replace Date: Fri, 30 Aug 2013 17:35:48 +0200 To: user@cassandra.apache.org Hi, I have a use case, where I periodically need to apply updates to a wide row that should replace the whole row. The straight-forward insert/update only replace values that are present in the executed statement, keeping remaining data around. Is there a smooth way to do a replace with C* or do I have to handle this by the application (e.g. doing delete and then write or coming up with a more clever data model)? Jan
RE: Listblob retrieve performance
I don't know of any. I would check the size of LIST. If it is taking long, it could be just that disk read is taking long. Date: Sat, 31 Aug 2013 16:35:22 -0300 Subject: Listblob retrieve performance From: savio.te...@lupa.inf.ufg.br To: user@cassandra.apache.org I have a column family with this conf: CREATE TABLE geoms ( geom_key text PRIMARY KEY, part_geom listblob, the_geom text ) WITH bloom_filter_fp_chance=0.01 AND caching='KEYS_ONLY' AND comment='' AND dclocal_read_repair_chance=0.00 AND gc_grace_seconds=864000 AND read_repair_chance=0.10 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND compaction={'class': 'SizeTieredCompactionStrategy'} AND compression={'sstable_compression': 'SnappyCompressor'}; I run this query select geom_key, the_geom, part_geom from geoms limit 1; in 700ms. When I run the same query without part_geom attr (select geom_key, the_geom from geoms limit 1;), the query runs in 5 ms. Is there a performance problem with a Listblob attribute? Thanks in advance -- Atenciosamente, Sávio S. Teles de Oliveira voice: +55 62 9136 6996 http://br.linkedin.com/in/savioteles Mestrando em Ciências da Computação - UFG Arquiteto de Software Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG
Re: Versioning in cassandra
I use the technique described in my previous message to handle millions of messages and their versions. Actually, I use timeuuid's instead of timestamps, as they have more 'uniqueness'. Also I index my maps by a timeuuid that is the complement (based on a future date) of a current timeuuid. Since maps are kept sorted by key, this means I can just pop off the first one to get the most recent. The downside of this approach is that you get more stuff returned to you from Cassandra than you need. To mitigate that I queue a job to examine and correct the situation if, upon doing a read, the number of versions for a particular key is higher than some threshold, e.g. 50. There are many ways to approach this problem. Our actual implementation proceeds to another level, as we also have replicas of versions. This happens because we process important transactions in parallel and can expect up to 9 replicas of each version. We journal them all and use them for reporting latencies in our processing pipelines as well as for replay when we need to recover application state. Regards, Michael On Tue, Sep 3, 2013 at 3:15 PM, Laing, Michael michael.la...@nytimes.comwrote: try the following. -ml -- put this in file and run using 'cqlsh -f file DROP KEYSPACE latest; CREATE KEYSPACE latest WITH replication = { 'class': 'SimpleStrategy', 'replication_factor' : 1 }; USE latest; CREATE TABLE file ( parentid text, -- row_key, same for each version id text, -- column_key, same for each version contenttype maptimestamp, text, -- differs by version, version is the key to the map PRIMARY KEY (parentid, id) ); update file set contenttype = contenttype + {'2011-03-04':'pdf1'} where parentid = 'd1' and id = 'f1'; update file set contenttype = contenttype + {'2011-03-05':'pdf2'} where parentid = 'd1' and id = 'f1'; update file set contenttype = contenttype + {'2011-03-04':'pdf3'} where parentid = 'd1' and id = 'f2'; update file set contenttype = contenttype + {'2011-03-05':'pdf4'} where parentid = 'd1' and id = 'f2'; select * from file where parentid = 'd1'; -- returns: -- parentid | id | contenttype ++-- -- d1 | f1 | {'2011-03-04 00:00:00-0500': 'pdf1', '2011-03-05 00:00:00-0500': 'pdf2'} -- d1 | f2 | {'2011-03-04 00:00:00-0500': 'pdf3', '2011-03-05 00:00:00-0500': 'pdf4'} -- use an app to pop off the latest version from the map -- map other varying fields using the same technique as used for contenttype On Tue, Sep 3, 2013 at 2:31 PM, Vivek Mishra mishra.v...@gmail.comwrote: create table file(id text , parentid text,contenttype text,version timestamp, descr text, name text, PRIMARY KEY(id,version) ) WITH CLUSTERING ORDER BY (version DESC); insert into file (id, parentid, version, contenttype, descr, name) values ('f2', 'd1', '2011-03-06', 'pdf', 'f2 file', 'file1'); insert into file (id, parentid, version, contenttype, descr, name) values ('f2', 'd1', '2011-03-05', 'pdf', 'f2 file', 'file1'); insert into file (id, parentid, version, contenttype, descr, name) values ('f1', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1'); insert into file (id, parentid, version, contenttype, descr, name) values ('f1', 'd1', '2011-03-04', 'pdf', 'f1 file', 'file1'); create index on file(parentid); select * from file where id='f1' and parentid='d1' limit 1; select * from file where parentid='d1' limit 1; Will it work for you? -Vivek On Tue, Sep 3, 2013 at 11:29 PM, Vivek Mishra mishra.v...@gmail.comwrote: My bad. I did miss out to read latest version part. -Vivek On Tue, Sep 3, 2013 at 11:20 PM, dawood abdullah muhammed.daw...@gmail.com wrote: I have tried with both the options creating secondary index and also tried adding parentid to primary key, but I am getting all the files with parentid 'yyy', what I want is the latest version of file with the combination of parentid, fileid. Say below are the records inserted in the file table: insert into file (id, parentid, version, contenttype, description, name) values ('f1', 'd1', '2011-03-04', 'pdf', 'f1 file', 'file1'); insert into file (id, parentid, version, contenttype, description, name) values ('f1', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1'); insert into file (id, parentid, version, contenttype, description, name) values ('f2', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1'); insert into file (id, parentid, version, contenttype, description, name) values ('f2', 'd1', '2011-03-06', 'pdf', 'f1 file', 'file1'); I want to write a query which returns me second and last record and not the first and third record, because for the first and third record there exists a latest version, for the combination of id and parentid. I am confused If at all this is achievable, please suggest. Dawood On Tue, Sep 3, 2013 at 10:58 PM, Vivek Mishra mishra.v...@gmail.comwrote:
Re: Update-Replace
Baskar, On 03.09.2013, at 23:11, Baskar Duraikannu baskar.duraika...@outlook.com wrote: I have a similar use case but only need to update portion of the row. We basically perform single write (with old and new columns) with very low value of ttl for old columns. I found out that using bound statements with java-driver works quite well for this case because the fields with a ? in the prepared statement but without a bound value will be automatically set to null - hence removed. So this actually automagically does what you/I want. See https://groups.google.com/a/lists.datastax.com/d/msg/java-driver-user/APfnKNTXuvE/gBeCk37jgRAJ Jan From: jan.algermis...@nordsc.com Subject: Update-Replace Date: Fri, 30 Aug 2013 17:35:48 +0200 To: user@cassandra.apache.org Hi, I have a use case, where I periodically need to apply updates to a wide row that should replace the whole row. The straight-forward insert/update only replace values that are present in the executed statement, keeping remaining data around. Is there a smooth way to do a replace with C* or do I have to handle this by the application (e.g. doing delete and then write or coming up with a more clever data model)? Jan
cqlsh error after enabling encryption
Hi All, After enabling encryption on our Cassandra 1.2.8 nodes, we receiving the error Connection error: TSocket read 0 bytes while attempting to use CQLsh to talk to the ring. I've followed the docs over at http://www.datastax.com/documentation/cassandra/1.2/webhelp/cassandra/security/secureCqlshSSL_t.html but can't seem to figure out why this isn't working. Inter-node communication seems to be working properly since nodetool status shows our nodes as up, but the CQLsh client is unable to talk to a single node or any node in the cluster (specifying the IP in .cqlshrc or on the CLI) for some reason. I'm providing the applicable config file entries below for review. Any insight or suggestions would be greatly appreciated! :) My ~/.cqlshrc file: [connection] hostname = 127.0.0.1 port = 9160 factory = cqlshlib.ssl.ssl_transport_factory [ssl] certfile = /etc/cassandra/conf/cassandra_client.crt validate = true ## Optional, true by default. [certfiles] ## Optional section, overrides the default certfile in the [ssl] section. 192.168.1.3 = ~/keys/cassandra01.cert 192.168.1.4 = ~/keys/cassandra02.cert Our cassandra.yaml file config blocks: …snip… server_encryption_options: internode_encryption: all keystore: /etc/cassandra/conf/.keystore keystore_password: yeah-right truststore: /etc/cassandra/conf/.truststore truststore_password: yeah-right # More advanced defaults below: # protocol: TLS # algorithm: SunX509 # store_type: JKS # cipher_suites: [TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA] # require_client_auth: false # enable or disable client/server encryption. client_encryption_options: enabled: true keystore: /etc/cassandra/conf/.keystore keystore_password: yeah-right # require_client_auth: false # Set trustore and truststore_password if require_client_auth is true # truststore: conf/.truststore # truststore_password: cassandra # More advanced defaults below: protocol: TLS algorithm: SunX509 store_type: JKS cipher_suites: [TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA] …snip... Thanks, -David Laube
Re[2]: How to fix host ID collision?
Thanks a lot for the quick reply, Should I run the nodetool repair on all nodes before or after that? Also, it's mentioned in the documentation that auto_bootstrap setting is applied only to non-seed nodes. Currently I specified all nodes as seeds, should I remove nodes with new IP from seeds then? Вторник, 3 сентября 2013, 14:08 -07:00 от Robert Coli rc...@eventbrite.com: On Tue, Sep 3, 2013 at 2:01 PM, Renat Gilfanov gren...@mail.ru wrote: We have Cassandra cluster with 5 nodes hosted in the Amazon EC2, and I had to restart two of them, so their IPs changed. We use NetworkTopologyStrategy, so I simply updated IPs in the cassandra-topology.properties file. Set auto_bootstrap:false in the conf file and restart the node to change IP address for a node. =Rob
Re: Listblob retrieve performance
The list is null. 2013/9/3 Baskar Duraikannu baskar.duraika...@outlook.com I don't know of any. I would check the size of LIST. If it is taking long, it could be just that disk read is taking long. -- Date: Sat, 31 Aug 2013 16:35:22 -0300 Subject: Listblob retrieve performance From: savio.te...@lupa.inf.ufg.br To: user@cassandra.apache.org I have a column family with this conf: CREATE TABLE geoms ( geom_key text PRIMARY KEY, part_geom listblob, the_geom text ) WITH bloom_filter_fp_chance=0.01 AND caching='KEYS_ONLY' AND comment='' AND dclocal_read_repair_chance=0.00 AND gc_grace_seconds=864000 AND read_repair_chance=0.10 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND compaction={'class': 'SizeTieredCompactionStrategy'} AND compression={'sstable_compression': 'SnappyCompressor'}; I run this query *select geom_key, the_geom, part_geom from geoms limit 1;* in 700ms. When I run the same query without part_geom attr (*select geom_key, the_geom from geoms limit 1;)*, the query runs in 5 ms. *Is there a performance problem with a Listblob attribute? * *Thanks in advance * -- Atenciosamente, Sávio S. Teles de Oliveira voice: +55 62 9136 6996 http://br.linkedin.com/in/savioteles Mestrando em Ciências da Computação - UFG Arquiteto de Software Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG -- Atenciosamente, Sávio S. Teles de Oliveira voice: +55 62 9136 6996 http://br.linkedin.com/in/savioteles Mestrando em Ciências da Computação - UFG Arquiteto de Software Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG
Fwd: {kundera-discuss} Kundera 2.7 released
fyip. -- Forwarded message -- From: Vivek Mishra vivek.mis...@impetus.co.in Date: Wed, Sep 4, 2013 at 6:15 AM Subject: {kundera-discuss} Kundera 2.7 released To: kundera-disc...@googlegroups.com kundera-disc...@googlegroups.com Hi All, We are happy to announce the release of Kundera 2.7 . Kundera is a JPA 2.0 compliant, object-datastore mapping library for NoSQL datastores. The idea behind Kundera is to make working with NoSQL databases drop-dead simple and fun. It currently supports Cassandra, HBase, MongoDB, Redis, OracleNoSQL, Neo4j,ElasticSearch and relational databases. Major Changes: 1) Support for pagination over Mongodb. 2) Added elastic search as datastore and fallback indexing mechanism. Github Bug Fixes: https://github.com/impetus-opensource/Kundera/issues/234 https://github.com/impetus-opensource/Kundera/issues/215 https://github.com/impetus-opensource/Kundera/issues/201 https://github.com/impetus-opensource/Kundera/issues/333 https://github.com/impetus-opensource/Kundera/issues/362 https://github.com/impetus-opensource/Kundera/issues/350 https://github.com/impetus-opensource/Kundera/issues/365 How to Download: To download, use or contribute to Kundera, visit: http://github.com/impetus-opensource/Kundera Latest released tag version is 2.7 Kundera maven libraries are now available at: https://oss.sonatype.org/content/repositories/releases/com/impetus Sample codes and examples for using Kundera can be found here: https://github.com/impetus-opensource/Kundera/tree/trunk/kundera-tests Survey/Feedback: http://www.surveymonkey.com/s/BMB9PWG Thank you all for your contributions and using Kundera! Sincerely, Kundera Team NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference. -- You received this message because you are subscribed to the Google Groups kundera-discuss group. To unsubscribe from this group and stop receiving emails from it, send an email to kundera-discuss+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.