Re: I don't understand paging through a table by primary key.
Hello Kevin Can you be more specific on the issue you're facing ? What is the table design ? What kind of query are you doing ? Regards On Fri, May 30, 2014 at 7:10 AM, Kevin Burton bur...@spinn3r.com wrote: I'm trying to grok this but I can't figure it out in CQL world. I'd like to efficiently page through a table via primary key. This way I only involve one node at a time and the reads on disk are contiguous. I would have assumed it was a combination of pk and order by but that doesn't seem to work. -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
Insert failed after some time in cassandra with timeout
I have installed Cassandra 2.0 On CentOS6.5 Server and and while testing simple records everything is working fine, Now I have to upload 600 billion rows, when I use COPY on cqlsh it failed after 5 minutes and approx rows inserted are 0.2 million with rpc timeout, then I opted for pycasso and parsed csv and tried to import using inserts commands, after every 10K records, We opted to close connection and develop new connection again. but after around 60k records it failed with with timeout. My debug trace shows something this while server is not accepting inserts, without any activity it's still busy. DEBUG [OptionalTasks:1] 2014-05-30 04:34:16,305 MeteredFlusher.java (line 41) Currently flushing 269227480 bytes of 2047868928 max DEBUG [OptionalTasks:1] 2014-05-30 04:34:17,306 MeteredFlusher.java (line 41) Currently flushing 269227480 bytes of 2047868928 max DEBUG [OptionalTasks:1] 2014-05-30 04:34:18,306 MeteredFlusher.java (line 41) Currently flushing 269227480 bytes of 2047868928 max DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,012 ColumnFamilyStore.java (line 298) retryPolicy for schema_triggers is 0.99 DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,012 ColumnFamilyStore.java (line 298) retryPolicy for compaction_history is 0.99 DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,012 ColumnFamilyStore.java (line 298) retryPolicy for batchlog is 0.99 DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,012 ColumnFamilyStore.java (line 298) retryPolicy for sstable_activity is 0.99 DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,012 ColumnFamilyStore.java (line 298) retryPolicy for peer_events is 0.99 DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,012 ColumnFamilyStore.java (line 298) retryPolicy for compactions_in_progress is 0.99 DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,013 ColumnFamilyStore.java (line 298) retryPolicy for hints is 0.99 DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,013 ColumnFamilyStore.java (line 298) retryPolicy for schema_keyspaces is 0.99 DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,013 ColumnFamilyStore.java (line 298) retryPolicy for range_xfers is 0.99 DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,013 ColumnFamilyStore.java (line 298) retryPolicy for schema_columnfamilies is 0.99 DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,013 ColumnFamilyStore.java (line 298) retryPolicy for NodeIdInfo is 0.99 DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,013 ColumnFamilyStore.java (line 298) retryPolicy for paxos is 0.99 DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,013 ColumnFamilyStore.java (line 298) retryPolicy for schema_columns is 0.99 DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,014 ColumnFamilyStore.java (line 298) retryPolicy for IndexInfo is 0.99 DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,014 ColumnFamilyStore.java (line 298) retryPolicy for peers is 0.99 DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,014 ColumnFamilyStore.java (line 298) retryPolicy for local is 0.99 DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,307 MeteredFlusher.java (line 41) Currently flushing 269227480 bytes of 2047868928 max DEBUG [OptionalTasks:1] 2014-05-30 04:34:20,307 MeteredFlusher.java (line 41) Currently flushing 269227480 bytes of 2047868928 max DEBUG [OptionalTasks:1] 2014-05-30 04:34:20,716 ColumnFamilyStore.java (line 298) retryPolicy for backup_calls is 0.99 DEBUG [OptionalTasks:1] 2014-05-30 04:34:20,716 ColumnFamilyStore.java (line 298) retryPolicy for sessions is 0.99 DEBUG [OptionalTasks:1] 2014-05-30 04:34:20,716 ColumnFamilyStore.java (line 298) retryPolicy for events is 0.99 DEBUG [OptionalTasks:1] 2014-05-30 04:34:21,308 MeteredFlusher.java (line 41) Currently flushing 269 while When I try to insert records it show error like this in debug log. DEBUG [OptionalTasks:1] 2014-05-30 04:34:40,717 ColumnFamilyStore.java (line 298) retryPolicy for backup_calls is 0.99 DEBUG [OptionalTasks:1] 2014-05-30 04:34:40,717 ColumnFamilyStore.java (line 298) retryPolicy for sessions is 0.99 DEBUG [OptionalTasks:1] 2014-05-30 04:34:40,718 ColumnFamilyStore.java (line 298) retryPolicy for events is 0.99 DEBUG [Thrift:24] 2014-05-30 04:34:40,775 CustomTThreadPoolServer.java (line 211) Thrift transport error occurred during processing of message. org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129) at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:362) at
Re: Multi-DC Environment Question
Thanks for your responses, Ben thanks for the link. Basically you sort of confirmed that if down_time max_hint_window_in_ms the only way to bring DC1 up-to-date is anti-entropy repair. Read consistency level is irrelevant to the problem I described as I am reading LOCAL_QUORUM. In this situation I lost whatever data -if any- had not been transfered across to DC2 before DC1 went down, that is understandable. Also, read repair does not help either as we assumed that down_time max_hint_window_in_ms. Please correct me if I am wrong. I think I could better understand how that works if I knew the answers to the following questions: 1. What is the output of nodetool status when a cluster spans across 2 DCs? Will I be able to see ALL nodes irrespective of the DC they belong to? 2. How tokens are being assigned when adding a 2nd DC? Is the range -2^64 to 2^63 for each DC, or it is -2^64 to 2^63 for the entire cluster? (I think the latter is correct) 3. Does the coordinator store 1 hint irrespective of how many replicas happen to be down at the time and also irrespective of DC2 being down in the scenario I described above? (I think the answer is according to the presentation you sent me, but I would like someone to confirm that) Thank you in advance, Vasilis On Fri, May 30, 2014 at 3:13 AM, Ben Bromhead b...@instaclustr.com wrote: Short answer: If time elapsed max_hint_window_in_ms then hints will stop being created. You will need to rely on your read consistency level, read repair and anti-entropy repair operations to restore consistency. Long answer: http://www.slideshare.net/jasedbrown/understanding-antientropy-in-cassandra Ben Bromhead Instaclustr | www.instaclustr.com | @instaclustr http://twitter.com/instaclustr | +61 415 936 359 On 30 May 2014, at 8:40 am, Tupshin Harper tups...@tupshin.com wrote: When one node or DC is down, coordinator nodes being written through will notice this fact and store hints (hinted handoff is the mechanism), and those hints are used to send the data that was not able to be replicated initially. http://www.datastax.com/dev/blog/modern-hinted-handoff -Tupshin On May 29, 2014 6:22 PM, Vasileios Vlachos vasileiosvlac...@gmail.com wrote: Hello All, We have plans to add a second DC to our live Cassandra environment. Currently RF=3 and we read and write at QUORUM. After adding DC2 we are going to be reading and writing at LOCAL_QUORUM. If my understanding is correct, when a client sends a write request, if the consistency level is satisfied on DC1 (that is RF/2+1), success is returned to the client and DC2 will eventually get the data as well. The assumption behind this is that the the client always connects to DC1 for reads and writes and given that there is a site-to-site VPN between DC1 and DC2. Therefore, DC1 will almost always return success before DC2 (actually I don't know if it is possible for DC2 to be more up-to-date than DC1 with this setup...). Now imagine DC1 looses connectivity and the client fails over to DC2. Everything should work fine after that, with the only difference that DC2 will be now handling the requests directly from the client. After some time, say after max_hint_window_in_ms, DC1 comes back up. My question is how do I bring DC1 up to speed with DC2 which is now more up-to-date? Will that require a nodetool repair on DC1 nodes? Also, what is the answer when the outage is max_hint_window_in_ms instead? Thanks in advance! Vasilis -- Kind Regards, Vasileios Vlachos
Impact of Bloom filter false positive rate
Hi, I'm currently working on some properties of Bloom filters and this is the first time I use Cassandre, so I'm sorry if my question seems dumb. Basically, I try to see the impact of the false positive rate of Bloom filter on performance. My test case is: 1. I create a table with: create table bloom.test_fp (t text primary key, d text) with bloom_filter_fp_chance = fp_rate 2. I fill this table with 10 rows using random data 3. I force the creation of SSTable by flushing Memtable with nodetool flush 4. I mesure the time required to perform 100 basic queries like select * from bloom.test_fp where t = random_data Surprisingly, there is not much difference depending on the false positive rate selected. I suspect some caches interfere. Is there a way for me to see the impact on performance without using large dataset? Thanks.
Re: Write Failed, COPY on cqlsh with rpc_timeout
Sharaf, Do the logs show any errors while you're trying to insert into Cassandra? -- Patricia Gorla @patriciagorla Consultant Apache Cassandra Consulting http://www.thelastpickle.com http://thelastpickle.com
Re: Managing truststores with inter-node encryption
It appears that only adding the CA certificate to the truststore is sufficient for this. On Thu, May 22, 2014 at 10:05 AM, Jeremy Jongsma jer...@barchart.com wrote: The docs say that each node needs every other node's certificate in its local truststore: http://www.datastax.com/documentation/cassandra/1.2/cassandra/security/secureSSLCertificates_t.html This seems like a bit of a headache for adding nodes to a cluster. How do others deal with this? 1) If I am self-signing the client certificates (with puppetmaster), is it enough that the truststore just contain the CA certificate used to sign them? This is the typical PKI mechanism for verifying trust, so I am hoping it works here. 2) If not, can I use the same certificate for every node? If so, what is the downside? I'm mainly concerned with encryption over public internet links, not node identity verification.
Re: Anyone using Astyanax in production besides Netflix itself?
Anyone who's already using Astyanax in production cluster? What C* do you use with Astyanax ?
Reading Cassandra Data From Pig/Hadoop
I am reasonably experienced with Hadoop and Pig but less so with Cassandra. I have been banging my head against the wall as all the documentation assumes I know something... I am using Apache's tarball of Cassandra 1.something and I see that there are some example pig scripts and a shell script to run them with the cassandra jars. What I don't understand is how you tell the pig script which machine the cassandra cluster talks to. You only specify the keyspace right - which roughly corresponds to the database/table, but not which cluster. Can you tell what I have missed? Does the hadoop nodes HAVE to be on the same machines as the Cassandra nodes? I am using CQL storage I think. eg -- CqlStorage libdata = LOAD 'cql://libdata/libout' USING CqlStorage(); book_by_mail = FILTER libdata BY C_OUT_TY == 'BM'; etc etc Thanks all...
Re: Anyone using Astyanax in production besides Netflix itself?
My team uses astyanax for 3 different c* clusters in production. we're on c* 1.2.xx. works well for our requirements - we don't use cql, mostly just time series data. But cutting this short, most people who ask about astyanax get redirected to their user group ( https://groups.google.com/forum/#!forum/astyanax-cassandra-client) - it might be best to get the info you're looking for from there - even though there is probably a large number of overlapping followers On Fri, May 30, 2014 at 10:41 AM, user 01 user...@gmail.com wrote: Anyone who's already using Astyanax in production cluster? What C* do you use with Astyanax ?
Re: I don't understand paging through a table by primary key.
On Thu, May 29, 2014 at 10:10 PM, Kevin Burton bur...@spinn3r.com wrote: I'd like to efficiently page through a table via primary key. This way I only involve one node at a time and the reads on disk are This is only true if you use an Ordered Partitioner, which almost no one does? I would have assumed it was a combination of pk and order by but that doesn't seem to work. http://wiki.apache.org/cassandra/FAQ#iter_world I don't personally use 2.0 at this time, so I have no idea how good or bad the answer from this FAQ is. There are also ways to iter world in thrift. Here's the pre-CQL FAQ page : http://wiki.apache.org/cassandra/FAQ?action=recallrev=148#iter_world =Rob
Re: Reading Cassandra Data From Pig/Hadoop
To specify your cassandra cluster, you only need to define one node: In you profile or batch command set and export these variables: export PIG_HOME=PATH TO PIG INSTALL export PIG_INITIAL_ADDRESS=localhost export PIG_RPC_PORT=9160 # the partitioner must match your cassandra partitioner export PIG_PARTITIONER=org.apache.cassandra.dht.Murmur3Partitioner http://www.schappet.com/pig_cassandra_bulk_load/ —Jimmy On May 30, 2014, at 11:50 AM, Alex McLintock a...@owal.co.uk wrote: I am reasonably experienced with Hadoop and Pig but less so with Cassandra. I have been banging my head against the wall as all the documentation assumes I know something... I am using Apache's tarball of Cassandra 1.something and I see that there are some example pig scripts and a shell script to run them with the cassandra jars. What I don't understand is how you tell the pig script which machine the cassandra cluster talks to. You only specify the keyspace right - which roughly corresponds to the database/table, but not which cluster. Can you tell what I have missed? Does the hadoop nodes HAVE to be on the same machines as the Cassandra nodes? I am using CQL storage I think. eg -- CqlStorage libdata = LOAD 'cql://libdata/libout' USING CqlStorage(); book_by_mail = FILTER libdata BY C_OUT_TY == 'BM'; etc etc Thanks all...
Re: I don't understand paging through a table by primary key.
I think what you want is a clustering column”. When you model your data, you specify “partition columns” which are synonymous with the old thrift style “keys” and clustering columns. When creating your PRIMARY KEY, you specify the partition column first then each subsequent column in the primary key is is the clustering columns. These columns determine how the data in that partition is stored on disk. For instance if i was storing time series events for URLs I might do something like this: PRIMARY KEY(url, event_time) This means that all events for a given URL will be stored contiguously in order on the same node. This allows the following type of query: SELECT * FROM events WHERE url = 'http://devdazed.com' and event_time ‘2014-01-01’ AND event_time ‘2014-01-07’; Make sense? On May 30, 2014 at 1:10:51 AM, Kevin Burton (bur...@spinn3r.com) wrote: I'm trying to grok this but I can't figure it out in CQL world. I'd like to efficiently page through a table via primary key. This way I only involve one node at a time and the reads on disk are contiguous. I would have assumed it was a combination of pk and order by but that doesn't seem to work. -- Founder/CEO Spinn3r.com Location: San Francisco, CA Skype: burtonator blog: http://burtonator.wordpress.com … or check out my Google+ profile War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
Re: I don't understand paging through a table by primary key.
The specific issue is I have a fairly large table, which is immutable, and I need to get it in a form where it can be downloaded, page by page, via an API. This would involve reading the whole table. I'd like to page through it by key order to efficiently read the rows to minimize random reads. It's slightly more complicated then that in that it's a log structured table… basically holding the equivalent of apache logs.. I need to read these out by time and give them to API callers. On Fri, May 30, 2014 at 12:21 AM, DuyHai Doan doanduy...@gmail.com wrote: Hello Kevin Can you be more specific on the issue you're facing ? What is the table design ? What kind of query are you doing ? Regards On Fri, May 30, 2014 at 7:10 AM, Kevin Burton bur...@spinn3r.com wrote: I'm trying to grok this but I can't figure it out in CQL world. I'd like to efficiently page through a table via primary key. This way I only involve one node at a time and the reads on disk are contiguous. I would have assumed it was a combination of pk and order by but that doesn't seem to work. -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people. -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
Re: I don't understand paging through a table by primary key.
Then the data model you chose is incorrect. As Rob Coli mentioned, you can not page through partitions that are ordered unless you are using an ordered partitioner. Your only option is to store the data differently. When using Cassandra you have to remember to “model your queries, not your data”. You can only page the entire table by using the TOKEN keyword, and this is not efficient. On May 30, 2014 at 1:17:37 PM, Kevin Burton (bur...@spinn3r.com) wrote: The specific issue is I have a fairly large table, which is immutable, and I need to get it in a form where it can be downloaded, page by page, via an API. This would involve reading the whole table. I'd like to page through it by key order to efficiently read the rows to minimize random reads. It's slightly more complicated then that in that it's a log structured table… basically holding the equivalent of apache logs.. I need to read these out by time and give them to API callers. On Fri, May 30, 2014 at 12:21 AM, DuyHai Doan doanduy...@gmail.com wrote: Hello Kevin Can you be more specific on the issue you're facing ? What is the table design ? What kind of query are you doing ? Regards On Fri, May 30, 2014 at 7:10 AM, Kevin Burton bur...@spinn3r.com wrote: I'm trying to grok this but I can't figure it out in CQL world. I'd like to efficiently page through a table via primary key. This way I only involve one node at a time and the reads on disk are contiguous. I would have assumed it was a combination of pk and order by but that doesn't seem to work. -- Founder/CEO Spinn3r.com Location: San Francisco, CA Skype: burtonator blog: http://burtonator.wordpress.com … or check out my Google+ profile War is peace. Freedom is slavery. Ignorance is strength. Corporations are people. -- Founder/CEO Spinn3r.com Location: San Francisco, CA Skype: burtonator blog: http://burtonator.wordpress.com … or check out my Google+ profile War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
Re: Reading Cassandra Data From Pig/Hadoop
There's a pig-with-cassandra script somewhere you should be using. It adds the jars, etc. One issue, is that you need to call register on the .jars from your pig scripts. Honestly, someone should write an example pig setup with modern hadoop, all the right register commands, real UPDATE queries encoded, and explain the whole thing. Took me like 2 days to get working and there are also gotchas in your pig scripts. And the fact that the output from cql is not encoded in tuples but the input must be is insane and maddening and VERY VERY VERY prone to error. On Fri, May 30, 2014 at 10:10 AM, James Schappet jschap...@gmail.com wrote: To specify your cassandra cluster, you only need to define one node: In you profile or batch command set and export these variables: export PIG_HOME=PATH TO PIG INSTALL export PIG_INITIAL_ADDRESS=localhost export PIG_RPC_PORT=9160 # the partitioner must match your cassandra partitioner export PIG_PARTITIONER=org.apache.cassandra.dht.Murmur3Partitioner http://www.schappet.com/pig_cassandra_bulk_load/ —Jimmy On May 30, 2014, at 11:50 AM, Alex McLintock a...@owal.co.uk wrote: I am reasonably experienced with Hadoop and Pig but less so with Cassandra. I have been banging my head against the wall as all the documentation assumes I know something... I am using Apache's tarball of Cassandra 1.something and I see that there are some example pig scripts and a shell script to run them with the cassandra jars. What I don't understand is how you tell the pig script which machine the cassandra cluster talks to. You only specify the keyspace right - which roughly corresponds to the database/table, but not which cluster. Can you tell what I have missed? Does the hadoop nodes HAVE to be on the same machines as the Cassandra nodes? I am using CQL storage I think. eg -- CqlStorage libdata = LOAD 'cql://libdata/libout' USING CqlStorage(); book_by_mail = FILTER libdata BY C_OUT_TY == 'BM'; etc etc Thanks all... -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
A
Sent from my iPhone
RE: Write Failed, COPY on cqlsh with rpc_timeout
Dear Patricia, Here is trace of Error for your reference, Other this is that it an single node server only. KeySpace is Created using CREATE KEYSPACE mykeyspace WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 1}; Created table using: CREATE TABLE details ( id bigint PRIMARY KEY, fname text, lname text, username text, address text, address_alt text, cell text, landline text, office text ) WITH COMPACT STORAGE; DEBUG [OptionalTasks:1] 2014-05-30 04:34:40,717 ColumnFamilyStore.java (line 298) retryPolicy for backup_calls is 0.99 DEBUG [OptionalTasks:1] 2014-05-30 04:34:40,717 ColumnFamilyStore.java (line 298) retryPolicy for sessions is 0.99 DEBUG [OptionalTasks:1] 2014-05-30 04:34:40,718 ColumnFamilyStore.java (line 298) retryPolicy for events is 0.99 DEBUG [Thrift:24] 2014-05-30 04:34:40,775 CustomTThreadPoolServer.java (line 211) Thrift transport error occurred during processing of message. org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129) at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:362) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:284) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:191) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27) at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:201) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) DEBUG [Thrift:19] 2014-05-30 04:34:40,775 CustomTThreadPoolServer.java (line 211) Thrift transport error occurred during processing of message. org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129) at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:362) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:284) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:191) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27) at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:201) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) DEBUG [Thrift:21] 2014-05-30 04:34:40,775 CustomTThreadPoolServer.java (line 211) Thrift transport error occurred during processing of message. org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129) at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:362) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:284) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:191) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27) at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:201) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) DEBUG [Thrift:1] 2014-05-30 04:34:40,775 CustomTThreadPoolServer.java (line 211) Thrift transport error occurred during processing of message. org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129) at
Cassandra Summit 2014 - San Francisco, CA
This year’s Cassandra Summit will be held on September 10th and 11th at The Westin St. Francis in San Francisco, CA. We invite you to submit your talk, register for free tickets, enroll in our 1-day Cassandra training (early bird pricing end May 31st), and apply for a seat at Cassandra Summit Bootcamp (a technical Cassandra workshop on Sept. 12th and 13th, taught by 6 active committers). Proposals should be submitted via the following form, before EOD May 31st: http://goo.gl/bwkGXn Registration and event details can be found here: http://goo.gl/F6JCOA Thanks, and we hope to see you there! Brady Gentile Community Manager DataStax 480.735.1133
Re: I don't understand paging through a table by primary key.
Hello Kevin One possible data model: CREATE TABLE myLog( day int //day format as MMdd, date timeuuid, log_message text, PRIMARY_KEY(day,date) ); For each day, you can query paging by date (timeuuid format). SELECT log_message FROM myLog where day = 20140530 AND date... LIMIT xxx; Of course, you need some client side code to move from one day to another. If the log volume for one day is too huge and risks creating ultra wide row, you can increase the partitioning resolution and take hour as partition key. In this case you would have: CREATE TABLE myLog( hour int //hour format as MMddHH, date timeuuid, log_message text, PRIMARY_KEY(hour,date) ); On Fri, May 30, 2014 at 7:20 PM, Russell Bradberry rbradbe...@gmail.com wrote: Then the data model you chose is incorrect. As Rob Coli mentioned, you can not page through partitions that are ordered unless you are using an ordered partitioner. Your only option is to store the data differently. When using Cassandra you have to remember to “model your queries, not your data”. You can only page the entire table by using the TOKEN keyword, and this is not efficient. On May 30, 2014 at 1:17:37 PM, Kevin Burton (bur...@spinn3r.com) wrote: The specific issue is I have a fairly large table, which is immutable, and I need to get it in a form where it can be downloaded, page by page, via an API. This would involve reading the whole table. I'd like to page through it by key order to efficiently read the rows to minimize random reads. It's slightly more complicated then that in that it's a log structured table… basically holding the equivalent of apache logs.. I need to read these out by time and give them to API callers. On Fri, May 30, 2014 at 12:21 AM, DuyHai Doan doanduy...@gmail.com wrote: Hello Kevin Can you be more specific on the issue you're facing ? What is the table design ? What kind of query are you doing ? Regards On Fri, May 30, 2014 at 7:10 AM, Kevin Burton bur...@spinn3r.com wrote: I'm trying to grok this but I can't figure it out in CQL world. I'd like to efficiently page through a table via primary key. This way I only involve one node at a time and the reads on disk are contiguous. I would have assumed it was a combination of pk and order by but that doesn't seem to work. -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people. -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
Shouldn't cqlsh have an option for no formatting and no headers?
I do this all the time with mysql… dump some database table to an output file so that I can use it in a script. but cqlsh insists on formatting the output. there should be an option for no headers and no whitespace formatting of the results. I mean I can work around it for now… but it's not going to be fun to always post process the output. -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
Re: Shouldn't cqlsh have an option for no formatting and no headers?
cqlsh isn’t designed for dumping data. I think you want COPY http://www.datastax.com/documentation/cql/3.0/cql/cql_reference/copy_r.html On May 30, 2014 at 2:32:24 PM, Kevin Burton (bur...@spinn3r.com) wrote: I do this all the time with mysql… dump some database table to an output file so that I can use it in a script. but cqlsh insists on formatting the output. there should be an option for no headers and no whitespace formatting of the results. I mean I can work around it for now… but it's not going to be fun to always post process the output. -- Founder/CEO Spinn3r.com Location: San Francisco, CA Skype: burtonator blog: http://burtonator.wordpress.com … or check out my Google+ profile War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
Re: backend query of a Cassandra db
There are few way you can do this really depends on preferences to have separate cluster or use same nodes etc... 1. If you have DSE they have hadoop/hive integrated or you can use Opensouce hive handler by tuple jump https://github.com/tuplejump/cash 2. Spark/Shark : Using Tuplejump Calliope and Cash (http://tuplejump.github.io/calliope/ , https://github.com/tuplejump/cash) you can refer to Brain ONeil's Blog here..http://brianoneill.blogspot.com/2014/03/shark-on-cassandra-w-cash-interrogating.html , http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html 3. PrestoDB http://prestodb.io/ Thanks Bobby On May 30, 2014, at 12:09 PM, cbert...@libero.it cbert...@libero.it wrote: Hello, I have a working cluster of Cassandra that performs very well on a high traffic web application. Now I need to build a backend web application to query Cassandra on many non indexed columns ... what is the best way to do that? Apache hive? Pig? Cassandra 2 Thanks
Re: Managing truststores with inter-node encryption
Java ssl sockets need to be able to build a chain of trust. So having either a nodes public cert or the root cert in the truststore works (as you found out). To get cassandra to use cypher suites 128 bit you will need to install the JCE unlimited strength jurisdiction policy files. You will know if you aren't using it because there will be a bunch of warnings quickly filling up your logs. Note that javas ssl implementation does not check certificate revocation lists by default, though as you are not using inter node for authentication and identification its no big deal. Ben On 31/05/2014 1:04 AM, Jeremy Jongsma jer...@barchart.com wrote: It appears that only adding the CA certificate to the truststore is sufficient for this. On Thu, May 22, 2014 at 10:05 AM, Jeremy Jongsma jer...@barchart.com wrote: The docs say that each node needs every other node's certificate in its local truststore: http://www.datastax.com/documentation/cassandra/1.2/cassandra/security/secureSSLCertificates_t.html This seems like a bit of a headache for adding nodes to a cluster. How do others deal with this? 1) If I am self-signing the client certificates (with puppetmaster), is it enough that the truststore just contain the CA certificate used to sign them? This is the typical PKI mechanism for verifying trust, so I am hoping it works here. 2) If not, can I use the same certificate for every node? If so, what is the downside? I'm mainly concerned with encryption over public internet links, not node identity verification.