Re: I don't understand paging through a table by primary key.

2014-05-30 Thread DuyHai Doan
Hello Kevin

 Can you be more specific on the issue you're facing ? What is the table
design ? What kind of query are you doing ?

 Regards


On Fri, May 30, 2014 at 7:10 AM, Kevin Burton bur...@spinn3r.com wrote:

 I'm trying to grok this but I can't figure it out in CQL world.

 I'd like to efficiently page through a table via primary key.

 This way I only involve one node at a time and the reads on disk are
 contiguous.

 I would have assumed it was a combination of  pk and order by but that
 doesn't seem to work.

 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 Skype: *burtonator*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com
 War is peace. Freedom is slavery. Ignorance is strength. Corporations are
 people.




Insert failed after some time in cassandra with timeout‏

2014-05-30 Thread Sharaf Ali
I have installed Cassandra 2.0 On CentOS6.5 Server and and while testing
 simple records everything is working fine, Now I have to upload 600 
billion rows, when I use COPY on cqlsh it failed after 5 minutes and 
approx rows inserted are 0.2 million with rpc timeout, then I opted for 
pycasso and parsed csv and tried to import using inserts commands, after
 every 10K records, We opted to close connection and develop new 
connection again. but after around 60k records it failed with with 
timeout.

My debug trace shows something this while server is not accepting inserts, 
without any activity it's still busy.

   
 DEBUG [OptionalTasks:1] 2014-05-30 04:34:16,305 MeteredFlusher.java 
(line 41) Currently flushing 269227480 bytes of 2047868928 max

DEBUG [OptionalTasks:1] 2014-05-30 04:34:17,306 MeteredFlusher.java 
(line 41) Currently flushing 269227480 bytes of 2047868928 max

DEBUG [OptionalTasks:1] 2014-05-30 04:34:18,306 MeteredFlusher.java 
(line 41) Currently flushing 269227480 bytes of 2047868928 max
DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,012 ColumnFamilyStore.java 
(line 298) retryPolicy for schema_triggers is 0.99
DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,012 ColumnFamilyStore.java 
(line 298) retryPolicy for compaction_history is 0.99
DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,012 ColumnFamilyStore.java 
(line 298) retryPolicy for batchlog is 0.99
DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,012 ColumnFamilyStore.java 
(line 298) retryPolicy for sstable_activity is 0.99
DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,012 ColumnFamilyStore.java 
(line 298) retryPolicy for peer_events is 0.99
   
 DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,012 ColumnFamilyStore.java 
(line 298) retryPolicy for compactions_in_progress is 0.99
DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,013 ColumnFamilyStore.java 
(line 298) retryPolicy for hints is 0.99
DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,013 ColumnFamilyStore.java 
(line 298) retryPolicy for schema_keyspaces is 0.99
DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,013 ColumnFamilyStore.java 
(line 298) retryPolicy for range_xfers is 0.99
   
 DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,013 ColumnFamilyStore.java 
(line 298) retryPolicy for schema_columnfamilies is 0.99
DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,013 ColumnFamilyStore.java 
(line 298) retryPolicy for NodeIdInfo is 0.99
DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,013 ColumnFamilyStore.java 
(line 298) retryPolicy for paxos is 0.99
DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,013 ColumnFamilyStore.java 
(line 298) retryPolicy for schema_columns is 0.99
DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,014 ColumnFamilyStore.java 
(line 298) retryPolicy for IndexInfo is 0.99
DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,014 ColumnFamilyStore.java 
(line 298) retryPolicy for peers is 0.99
DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,014 ColumnFamilyStore.java 
(line 298) retryPolicy for local is 0.99
   
 DEBUG [OptionalTasks:1] 2014-05-30 04:34:19,307 MeteredFlusher.java 
(line 41) Currently flushing 269227480 bytes of 2047868928 max

DEBUG [OptionalTasks:1] 2014-05-30 04:34:20,307 MeteredFlusher.java 
(line 41) Currently flushing 269227480 bytes of 2047868928 max
DEBUG [OptionalTasks:1] 2014-05-30 04:34:20,716 ColumnFamilyStore.java 
(line 298) retryPolicy for backup_calls is 0.99
DEBUG [OptionalTasks:1] 2014-05-30 04:34:20,716 ColumnFamilyStore.java 
(line 298) retryPolicy for sessions is 0.99
DEBUG [OptionalTasks:1] 2014-05-30 04:34:20,716 ColumnFamilyStore.java 
(line 298) retryPolicy for events is 0.99
DEBUG [OptionalTasks:1] 2014-05-30 04:34:21,308 MeteredFlusher.java (line 
41) Currently flushing 269

while When I try to insert records it show error like this in debug log.

DEBUG [OptionalTasks:1] 2014-05-30 04:34:40,717 ColumnFamilyStore.java 
(line 298) retryPolicy for backup_calls is 0.99
DEBUG [OptionalTasks:1] 2014-05-30 04:34:40,717 ColumnFamilyStore.java 
(line 298) retryPolicy for sessions is 0.99
DEBUG [OptionalTasks:1] 2014-05-30 04:34:40,718 ColumnFamilyStore.java 
(line 298) retryPolicy for events is 0.99
   
 DEBUG [Thrift:24] 2014-05-30 04:34:40,775 CustomTThreadPoolServer.java 
(line 211) Thrift transport error occurred during processing of message.
org.apache.thrift.transport.TTransportException
at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at 
org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
at 
org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at 
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:362)
at 

Re: Multi-DC Environment Question

2014-05-30 Thread Vasileios Vlachos
Thanks for your responses, Ben thanks for the link.

Basically you sort of confirmed that if down_time  max_hint_window_in_ms
the only way to bring DC1 up-to-date is anti-entropy repair. Read
consistency level is irrelevant to the problem I described as I am reading
LOCAL_QUORUM. In this situation I lost whatever data -if any- had not been
transfered across to DC2 before DC1 went down, that is understandable.
Also, read repair does not help either as we assumed that down_time 
max_hint_window_in_ms. Please correct me if I am wrong.

I think I could better understand how that works if I knew the answers to
the following questions:
1. What is the output of nodetool status when a cluster spans across 2 DCs?
Will I be able to see ALL nodes irrespective of the DC they belong to?
2. How tokens are being assigned when adding a 2nd DC? Is the range -2^64
to 2^63 for each DC, or it is  -2^64 to 2^63 for the entire cluster? (I
think the latter is correct)
3. Does the coordinator store 1 hint irrespective of how many replicas
happen to be down at the time and also irrespective of DC2 being down in
the scenario I described above? (I think the answer is according to the
presentation you sent me, but I would like someone to confirm that)

Thank you in advance,

Vasilis


On Fri, May 30, 2014 at 3:13 AM, Ben Bromhead b...@instaclustr.com wrote:

 Short answer:

 If time elapsed  max_hint_window_in_ms then hints will stop being
 created. You will need to rely on your read consistency level, read repair
 and anti-entropy repair operations to restore consistency.

 Long answer:

 http://www.slideshare.net/jasedbrown/understanding-antientropy-in-cassandra

 Ben Bromhead
 Instaclustr | www.instaclustr.com | @instaclustr
 http://twitter.com/instaclustr | +61 415 936 359

 On 30 May 2014, at 8:40 am, Tupshin Harper tups...@tupshin.com wrote:

 When one node or DC is down, coordinator nodes being written through will
 notice this fact and store hints (hinted handoff is the mechanism),  and
 those hints are used to send the data that was not able to be replicated
 initially.

 http://www.datastax.com/dev/blog/modern-hinted-handoff

 -Tupshin
 On May 29, 2014 6:22 PM, Vasileios Vlachos vasileiosvlac...@gmail.com
 wrote:

  Hello All,

 We have plans to add a second DC to our live Cassandra environment.
 Currently RF=3 and we read and write at QUORUM. After adding DC2 we are
 going to be reading and writing at LOCAL_QUORUM.

 If my understanding is correct, when a client sends a write request, if
 the consistency level is satisfied on DC1 (that is RF/2+1), success is
 returned to the client and DC2 will eventually get the data as well. The
 assumption behind this is that the the client always connects to DC1 for
 reads and writes and given that there is a site-to-site VPN between DC1 and
 DC2. Therefore, DC1 will almost always return success before DC2 (actually
 I don't know if it is possible for DC2 to be more up-to-date than DC1 with
 this setup...).

 Now imagine DC1 looses connectivity and the client fails over to DC2.
 Everything should work fine after that, with the only difference that DC2
 will be now handling the requests directly from the client. After some
 time, say after max_hint_window_in_ms, DC1 comes back up. My question is
 how do I bring DC1 up to speed with DC2 which is now more up-to-date? Will
 that require a nodetool repair on DC1 nodes? Also, what is the answer
 when the outage is  max_hint_window_in_ms instead?

 Thanks in advance!

 Vasilis

 --
 Kind Regards,

 Vasileios Vlachos





Impact of Bloom filter false positive rate

2014-05-30 Thread Thomas GERBET
Hi,

I'm currently working on some properties of Bloom filters and this is the
first time I use Cassandre, so I'm sorry if my question seems dumb.
Basically, I try to see the impact of the false positive rate of Bloom
filter on performance.

My test case is:
1. I create a table with:
create table bloom.test_fp (t text primary key, d text) with
bloom_filter_fp_chance = fp_rate
2. I fill this table with 10 rows using random data
3. I force the creation of SSTable by flushing Memtable with nodetool flush
4. I mesure the time required to perform 100 basic queries like select
* from bloom.test_fp where t = random_data

 Surprisingly, there is not much difference depending on the false positive
rate selected. I suspect some caches interfere.

Is there a way for me to see the impact on performance without using large
dataset?

Thanks.


Re: Write Failed, COPY on cqlsh with rpc_timeout‏

2014-05-30 Thread Patricia Gorla
Sharaf,

Do the logs show any errors while you're trying to insert into Cassandra?
-- 
Patricia Gorla
@patriciagorla

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com http://thelastpickle.com


Re: Managing truststores with inter-node encryption

2014-05-30 Thread Jeremy Jongsma
It appears that only adding the CA certificate to the truststore is
sufficient for this.


On Thu, May 22, 2014 at 10:05 AM, Jeremy Jongsma jer...@barchart.com
wrote:

 The docs say that each node needs every other node's certificate in its
 local truststore:


 http://www.datastax.com/documentation/cassandra/1.2/cassandra/security/secureSSLCertificates_t.html

 This seems like a bit of a headache for adding nodes to a cluster. How do
 others deal with this?

 1) If I am self-signing the client certificates (with puppetmaster), is it
 enough that the truststore just contain the CA certificate used to sign
 them? This is the typical PKI mechanism for verifying trust, so I am hoping
 it works here.

 2) If not, can I use the same certificate for every node? If so, what is
 the downside? I'm mainly concerned with encryption over public internet
 links, not node identity verification.





Re: Anyone using Astyanax in production besides Netflix itself?

2014-05-30 Thread user 01
Anyone who's already using Astyanax in production cluster? What C* do you
use with Astyanax ?


Reading Cassandra Data From Pig/Hadoop

2014-05-30 Thread Alex McLintock
I am reasonably experienced with Hadoop and Pig but less so with Cassandra.
I have been banging my head against the wall as all the documentation
assumes I know something...

I am using Apache's tarball of Cassandra 1.something and I see that there
are some example pig scripts and a shell script to run them with the
cassandra jars.

What I don't understand is how you tell the pig script which machine the
cassandra cluster talks to. You only specify the keyspace right - which
roughly corresponds to the database/table, but not which cluster.

Can you tell what I have missed? Does the hadoop nodes HAVE to be on the
same machines as the Cassandra nodes?

I am using CQL storage I think.

eg

-- CqlStorage
libdata = LOAD 'cql://libdata/libout' USING CqlStorage();
book_by_mail = FILTER libdata BY C_OUT_TY == 'BM';
etc etc


Thanks all...


Re: Anyone using Astyanax in production besides Netflix itself?

2014-05-30 Thread Jeremy Powell
My team uses astyanax for 3 different c* clusters in production. we're on
c* 1.2.xx. works well for our requirements - we don't use cql, mostly just
time series data.

But cutting this short, most people who ask about astyanax get redirected
to their user group (
https://groups.google.com/forum/#!forum/astyanax-cassandra-client) - it
might be best to get the info you're looking for from there - even though
there is probably a large number of overlapping followers


On Fri, May 30, 2014 at 10:41 AM, user 01 user...@gmail.com wrote:

 Anyone who's already using Astyanax in production cluster? What C* do you
 use with Astyanax ?



Re: I don't understand paging through a table by primary key.

2014-05-30 Thread Robert Coli
On Thu, May 29, 2014 at 10:10 PM, Kevin Burton bur...@spinn3r.com wrote:

 I'd like to efficiently page through a table via primary key.

 This way I only involve one node at a time and the reads on disk are


This is only true if you use an Ordered Partitioner, which almost no one
does?


 I would have assumed it was a combination of  pk and order by but that
 doesn't seem to work.


http://wiki.apache.org/cassandra/FAQ#iter_world

I don't personally use 2.0 at this time, so I have no idea how good or bad
the answer from this FAQ is.

There are also ways to iter world in thrift. Here's the pre-CQL FAQ page :

http://wiki.apache.org/cassandra/FAQ?action=recallrev=148#iter_world

=Rob


Re: Reading Cassandra Data From Pig/Hadoop

2014-05-30 Thread James Schappet
To specify your cassandra cluster, you only need to define one node:

In you profile or batch command set and export these variables:

export PIG_HOME=PATH TO PIG INSTALL

export PIG_INITIAL_ADDRESS=localhost

export PIG_RPC_PORT=9160

# the partitioner must match your cassandra partitioner

export PIG_PARTITIONER=org.apache.cassandra.dht.Murmur3Partitioner




http://www.schappet.com/pig_cassandra_bulk_load/

—Jimmy 



On May 30, 2014, at 11:50 AM, Alex McLintock a...@owal.co.uk wrote:

 I am reasonably experienced with Hadoop and Pig but less so with Cassandra. I 
 have been banging my head against the wall as all the documentation assumes I 
 know something...
 
 I am using Apache's tarball of Cassandra 1.something and I see that there are 
 some example pig scripts and a shell script to run them with the cassandra 
 jars. 
 
 What I don't understand is how you tell the pig script which machine the 
 cassandra cluster talks to. You only specify the keyspace right - which 
 roughly corresponds to the database/table, but not which cluster. 
 
 Can you tell what I have missed? Does the hadoop nodes HAVE to be on the same 
 machines as the Cassandra nodes?
 
 I am using CQL storage I think.
 
 eg
 
 
 -- CqlStorage
 libdata = LOAD 'cql://libdata/libout' USING CqlStorage();
 
 book_by_mail = FILTER libdata BY C_OUT_TY == 'BM';
 
 etc etc
 
 
 
 Thanks all...
 
 
 
 



Re: I don't understand paging through a table by primary key.

2014-05-30 Thread Russell Bradberry
I think what you want is a clustering column”.  When you model your data, you 
specify “partition columns” which are synonymous with the old thrift style 
“keys” and clustering columns.  When creating your PRIMARY KEY, you specify the 
partition column first then each subsequent column in the primary key is is the 
clustering columns. These columns determine how the data in that partition is 
stored on disk. 

For instance if i was storing time series events for URLs I might do something 
like this:

PRIMARY KEY(url, event_time)

This means that all events for a given URL will be stored contiguously in order 
on the same node.

This allows the following type of query:

SELECT * FROM events WHERE url = 'http://devdazed.com' and event_time  
‘2014-01-01’ AND event_time  ‘2014-01-07’;

Make sense?



On May 30, 2014 at 1:10:51 AM, Kevin Burton (bur...@spinn3r.com) wrote:

I'm trying to grok this but I can't figure it out in CQL world.

I'd like to efficiently page through a table via primary key.

This way I only involve one node at a time and the reads on disk are 
contiguous.  

I would have assumed it was a combination of  pk and order by but that doesn't 
seem to work.

--
Founder/CEO Spinn3r.com
Location: San Francisco, CA
Skype: burtonator
blog: http://burtonator.wordpress.com
… or check out my Google+ profile

War is peace. Freedom is slavery. Ignorance is strength. Corporations are 
people.

Re: I don't understand paging through a table by primary key.

2014-05-30 Thread Kevin Burton
The specific issue is I have a fairly large table, which is immutable, and
I need to get it in a form where it can be downloaded, page by page, via an
API.

This would involve reading the whole table.

I'd like to page through it by key order to efficiently read the rows to
minimize random reads.

It's slightly more complicated then that in that it's a log structured
table… basically holding the equivalent of apache logs..  I need to read
these out by time and give them to API callers.


On Fri, May 30, 2014 at 12:21 AM, DuyHai Doan doanduy...@gmail.com wrote:

 Hello Kevin

  Can you be more specific on the issue you're facing ? What is the table
 design ? What kind of query are you doing ?

  Regards


 On Fri, May 30, 2014 at 7:10 AM, Kevin Burton bur...@spinn3r.com wrote:

 I'm trying to grok this but I can't figure it out in CQL world.

 I'd like to efficiently page through a table via primary key.

 This way I only involve one node at a time and the reads on disk are
 contiguous.

 I would have assumed it was a combination of  pk and order by but that
 doesn't seem to work.

 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 Skype: *burtonator*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com
 War is peace. Freedom is slavery. Ignorance is strength. Corporations are
 people.





-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.


Re: I don't understand paging through a table by primary key.

2014-05-30 Thread Russell Bradberry
Then the data model you chose is incorrect.  As Rob Coli mentioned, you can not 
page through partitions that are ordered unless you are using an ordered 
partitioner.  Your only option is to store the data differently.  When using 
Cassandra you have to remember to “model your queries, not your data”.  You can 
only page the entire table by using the TOKEN keyword, and this is not 
efficient.



On May 30, 2014 at 1:17:37 PM, Kevin Burton (bur...@spinn3r.com) wrote:

The specific issue is I have a fairly large table, which is immutable, and I 
need to get it in a form where it can be downloaded, page by page, via an API.

This would involve reading the whole table.  

I'd like to page through it by key order to efficiently read the rows to 
minimize random reads.

It's slightly more complicated then that in that it's a log structured table… 
basically holding the equivalent of apache logs..  I need to read these out by 
time and give them to API callers.


On Fri, May 30, 2014 at 12:21 AM, DuyHai Doan doanduy...@gmail.com wrote:
Hello Kevin

 Can you be more specific on the issue you're facing ? What is the table design 
? What kind of query are you doing ?

 Regards


On Fri, May 30, 2014 at 7:10 AM, Kevin Burton bur...@spinn3r.com wrote:
I'm trying to grok this but I can't figure it out in CQL world.

I'd like to efficiently page through a table via primary key.

This way I only involve one node at a time and the reads on disk are 
contiguous.  

I would have assumed it was a combination of  pk and order by but that doesn't 
seem to work.

--
Founder/CEO Spinn3r.com
Location: San Francisco, CA
Skype: burtonator
blog: http://burtonator.wordpress.com
… or check out my Google+ profile

War is peace. Freedom is slavery. Ignorance is strength. Corporations are 
people.




--
Founder/CEO Spinn3r.com
Location: San Francisco, CA
Skype: burtonator
blog: http://burtonator.wordpress.com
… or check out my Google+ profile

War is peace. Freedom is slavery. Ignorance is strength. Corporations are 
people.

Re: Reading Cassandra Data From Pig/Hadoop

2014-05-30 Thread Kevin Burton
There's a pig-with-cassandra script somewhere you should be using.

It adds the jars, etc.

One issue, is that you need to call register on the .jars from your pig
scripts.

Honestly, someone should write an example pig setup with modern hadoop, all
the right register commands, real UPDATE queries encoded, and explain the
whole thing.

Took me like 2 days to get working and there are also gotchas in your pig
scripts.

And the fact that the output from cql is not encoded in tuples but the
input must be is insane and maddening and VERY VERY VERY prone to error.




On Fri, May 30, 2014 at 10:10 AM, James Schappet jschap...@gmail.com
wrote:

 To specify your cassandra cluster, you only need to define one node:

 In you profile or batch command set and export these variables:

 export PIG_HOME=PATH TO PIG INSTALL

 export PIG_INITIAL_ADDRESS=localhost

 export PIG_RPC_PORT=9160

 # the partitioner must match your cassandra partitioner
 export PIG_PARTITIONER=org.apache.cassandra.dht.Murmur3Partitioner




 http://www.schappet.com/pig_cassandra_bulk_load/

 —Jimmy



 On May 30, 2014, at 11:50 AM, Alex McLintock a...@owal.co.uk wrote:

 I am reasonably experienced with Hadoop and Pig but less so with
 Cassandra. I have been banging my head against the wall as all the
 documentation assumes I know something...

 I am using Apache's tarball of Cassandra 1.something and I see that there
 are some example pig scripts and a shell script to run them with the
 cassandra jars.

 What I don't understand is how you tell the pig script which machine the
 cassandra cluster talks to. You only specify the keyspace right - which
 roughly corresponds to the database/table, but not which cluster.

 Can you tell what I have missed? Does the hadoop nodes HAVE to be on the
 same machines as the Cassandra nodes?

 I am using CQL storage I think.

 eg



 -- CqlStorage
 libdata = LOAD 'cql://libdata/libout' USING CqlStorage();

 book_by_mail = FILTER libdata BY C_OUT_TY == 'BM';

 etc etc



 Thanks all...









-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.


A

2014-05-30 Thread Ruchir Jha


Sent from my iPhone


RE: Write Failed, COPY on cqlsh with rpc_timeout‏

2014-05-30 Thread Sharaf Ali
Dear Patricia,

Here is trace of Error for your reference, Other this is that it an single node 
server only.
KeySpace is Created using

CREATE KEYSPACE mykeyspace WITH REPLICATION = { 'class' : 
'NetworkTopologyStrategy', 'datacenter1' : 1};

Created table using:
CREATE TABLE details (
 id bigint PRIMARY KEY,
 fname text,
 lname text,
 username text,
 address text,
 address_alt text,
 cell text,
 landline text,
 office text
) WITH COMPACT STORAGE;

DEBUG [OptionalTasks:1] 2014-05-30 04:34:40,717 ColumnFamilyStore.java (line 
298) retryPolicy for backup_calls is 0.99
DEBUG [OptionalTasks:1] 2014-05-30 04:34:40,717 ColumnFamilyStore.java (line 
298) retryPolicy for sessions is 0.99
DEBUG [OptionalTasks:1] 2014-05-30 04:34:40,718 ColumnFamilyStore.java (line 
298) retryPolicy for events is 0.99
DEBUG [Thrift:24] 2014-05-30 04:34:40,775 CustomTThreadPoolServer.java (line 
211) Thrift transport error occurred during processing of message.
org.apache.thrift.transport.TTransportException
at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at 
org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
at 
org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at 
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:362)
at 
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:284)
at 
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:191)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27)
at 
org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:201)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
DEBUG [Thrift:19] 2014-05-30 04:34:40,775 CustomTThreadPoolServer.java (line 
211) Thrift transport error occurred during processing of message.
org.apache.thrift.transport.TTransportException
at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at 
org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
at 
org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at 
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:362)
at 
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:284)
at 
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:191)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27)
at 
org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:201)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
DEBUG [Thrift:21] 2014-05-30 04:34:40,775 CustomTThreadPoolServer.java (line 
211) Thrift transport error occurred during processing of message.
org.apache.thrift.transport.TTransportException
at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at 
org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
at 
org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at 
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:362)
at 
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:284)
at 
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:191)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27)
at 
org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:201)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
DEBUG [Thrift:1] 2014-05-30 04:34:40,775 CustomTThreadPoolServer.java (line 
211) Thrift transport error occurred during processing of message.
org.apache.thrift.transport.TTransportException
at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at 
org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
at 

Cassandra Summit 2014 - San Francisco, CA

2014-05-30 Thread Brady Gentile
This year’s Cassandra Summit will be held on September 10th and 11th at The 
Westin St. Francis in San Francisco, CA.

We invite you to submit your talk, register for free tickets, enroll in our 
1-day Cassandra training (early bird pricing end May 31st), and apply for a 
seat at Cassandra Summit Bootcamp (a technical Cassandra workshop on Sept. 12th 
and 13th, taught by 6 active committers).

Proposals should be submitted via the following form, before EOD May 31st: 
http://goo.gl/bwkGXn

Registration and event details can be found here: http://goo.gl/F6JCOA

Thanks, and we hope to see you there!

Brady Gentile
Community Manager
DataStax
480.735.1133







Re: I don't understand paging through a table by primary key.

2014-05-30 Thread DuyHai Doan
Hello Kevin

One possible data model:

CREATE TABLE myLog(
  day int //day format as MMdd,
  date timeuuid,
  log_message text,
  PRIMARY_KEY(day,date)
);

 For each day, you can query paging by date (timeuuid format). SELECT
log_message FROM myLog where day = 20140530 AND date... LIMIT xxx;

 Of course, you need some client side code to move from one day to another.
If the log volume for one day is too huge and risks creating ultra wide
row, you can increase the partitioning resolution and take hour as
partition key. In this case you would have:

CREATE TABLE myLog(
  hour int //hour format as MMddHH,
  date timeuuid,
  log_message text,
  PRIMARY_KEY(hour,date)
);





On Fri, May 30, 2014 at 7:20 PM, Russell Bradberry rbradbe...@gmail.com
wrote:

 Then the data model you chose is incorrect.  As Rob Coli mentioned, you
 can not page through partitions that are ordered unless you are using an
 ordered partitioner.  Your only option is to store the data differently.
  When using Cassandra you have to remember to “model your queries, not your
 data”.  You can only page the entire table by using the TOKEN keyword, and
 this is not efficient.



 On May 30, 2014 at 1:17:37 PM, Kevin Burton (bur...@spinn3r.com) wrote:

 The specific issue is I have a fairly large table, which is immutable, and
 I need to get it in a form where it can be downloaded, page by page, via an
 API.

 This would involve reading the whole table.

 I'd like to page through it by key order to efficiently read the rows to
 minimize random reads.

 It's slightly more complicated then that in that it's a log structured
 table… basically holding the equivalent of apache logs..  I need to read
 these out by time and give them to API callers.


 On Fri, May 30, 2014 at 12:21 AM, DuyHai Doan doanduy...@gmail.com
 wrote:

 Hello Kevin

  Can you be more specific on the issue you're facing ? What is the table
 design ? What kind of query are you doing ?

  Regards


 On Fri, May 30, 2014 at 7:10 AM, Kevin Burton bur...@spinn3r.com wrote:

 I'm trying to grok this but I can't figure it out in CQL world.

 I'd like to efficiently page through a table via primary key.

 This way I only involve one node at a time and the reads on disk are
 contiguous.

 I would have assumed it was a combination of  pk and order by but that
 doesn't seem to work.

 --

  Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 Skype: *burtonator*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
  http://spinn3r.com
  War is peace. Freedom is slavery. Ignorance is strength. Corporations
 are people.





 --

  Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 Skype: *burtonator*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
  http://spinn3r.com
  War is peace. Freedom is slavery. Ignorance is strength. Corporations are
 people.




Shouldn't cqlsh have an option for no formatting and no headers?

2014-05-30 Thread Kevin Burton
I do this all the time with mysql… dump some database table to an output
file so that I can use it in a script.

but cqlsh insists on formatting the output.

there should be an option for no headers and no whitespace formatting of
the results.

I mean I can work around it for now… but it's not going to be fun to always
post process the output.

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.


Re: Shouldn't cqlsh have an option for no formatting and no headers?

2014-05-30 Thread Russell Bradberry
cqlsh isn’t designed for dumping data. I think you want COPY 
http://www.datastax.com/documentation/cql/3.0/cql/cql_reference/copy_r.html



On May 30, 2014 at 2:32:24 PM, Kevin Burton (bur...@spinn3r.com) wrote:

I do this all the time with mysql… dump some database table to an output file 
so that I can use it in a script.

but cqlsh insists on formatting the output.

there should be an option for no headers and no whitespace formatting of the 
results.

I mean I can work around it for now… but it's not going to be fun to always 
post process the output.

--
Founder/CEO Spinn3r.com
Location: San Francisco, CA
Skype: burtonator
blog: http://burtonator.wordpress.com
… or check out my Google+ profile

War is peace. Freedom is slavery. Ignorance is strength. Corporations are 
people.

Re: backend query of a Cassandra db

2014-05-30 Thread Bobby Chowdary

There are few way you can do this really depends on preferences to have 
separate cluster or use same nodes etc...

1. If you have DSE they have hadoop/hive integrated or you can use Opensouce 
hive handler by tuple jump  https://github.com/tuplejump/cash
2. Spark/Shark : Using Tuplejump Calliope and Cash 
(http://tuplejump.github.io/calliope/ , https://github.com/tuplejump/cash)  you 
can refer to Brain ONeil's Blog 
here..http://brianoneill.blogspot.com/2014/03/shark-on-cassandra-w-cash-interrogating.html
 , http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html
3. PrestoDB http://prestodb.io/ 

Thanks
Bobby


On May 30, 2014, at 12:09 PM, cbert...@libero.it cbert...@libero.it wrote:

Hello,
I have a working cluster of Cassandra that performs very well on a high 
traffic web application. 
Now I need to build a backend web application to query Cassandra on many non 
indexed columns ... what is the best way to do that? Apache hive? Pig?


Cassandra 2 


Thanks


Re: Managing truststores with inter-node encryption

2014-05-30 Thread Ben Bromhead
Java ssl sockets need to be able to build a chain of trust. So having
either a nodes public cert or the root cert in the truststore works (as you
found out).

To get cassandra to use cypher suites  128 bit you will need to install
the JCE unlimited strength jurisdiction policy files. You will know if you
aren't using it because there will be a bunch of warnings quickly filling
up your logs.

Note that javas ssl implementation does not check certificate revocation
lists by default, though as you are not using inter node for authentication
and identification its no big deal.

Ben
 On 31/05/2014 1:04 AM, Jeremy Jongsma jer...@barchart.com wrote:

 It appears that only adding the CA certificate to the truststore is
 sufficient for this.


 On Thu, May 22, 2014 at 10:05 AM, Jeremy Jongsma jer...@barchart.com
 wrote:

 The docs say that each node needs every other node's certificate in its
 local truststore:


 http://www.datastax.com/documentation/cassandra/1.2/cassandra/security/secureSSLCertificates_t.html

 This seems like a bit of a headache for adding nodes to a cluster. How do
 others deal with this?

 1) If I am self-signing the client certificates (with puppetmaster), is
 it enough that the truststore just contain the CA certificate used to sign
 them? This is the typical PKI mechanism for verifying trust, so I am hoping
 it works here.

 2) If not, can I use the same certificate for every node? If so, what is
 the downside? I'm mainly concerned with encryption over public internet
 links, not node identity verification.