Re: Less frequent flushing with LCS

2015-03-02 Thread Dan Kinder
Nope, they flush every 5 to 10 minutes.

On Mon, Mar 2, 2015 at 1:13 PM, Daniel Chia danc...@coursera.org wrote:

 Do the tables look like they're being flushed every hour? It seems like
 the setting memtable_flush_after_mins which I believe defaults to 60
 could also affect how often your tables are flushed.

 Thanks,
 Daniel

 On Mon, Mar 2, 2015 at 11:49 AM, Dan Kinder dkin...@turnitin.com wrote:

 I see, thanks for the input. Compression is not enabled at the moment,
 but I may try increasing that number regardless.

 Also I don't think in-memory tables would work since the dataset is
 actually quite large. The pattern is more like a given set of rows will
 receive many overwriting updates and then not be touched for a while.

 On Fri, Feb 27, 2015 at 2:27 PM, Robert Coli rc...@eventbrite.com
 wrote:

 On Fri, Feb 27, 2015 at 2:01 PM, Dan Kinder dkin...@turnitin.com
 wrote:

 Theoretically sstable_size_in_mb could be causing it to flush (it's at
 the default 160MB)... though we are flushing well before we hit 160MB. I
 have not tried changing this but we don't necessarily want all the sstables
 to be large anyway,


 I've always wished that the log message told you *why* the SSTable was
 being flushed, which of the various bounds prompted the flush.

 In your case, the size on disk may be under 160MB because compression is
 enabled. I would start by increasing that size.

 Datastax DSE has in-memory tables for this use case.

 =Rob




 --
 Dan Kinder
 Senior Software Engineer
 Turnitin – www.turnitin.com
 dkin...@turnitin.com





-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


RDD partitions per executor in Cassandra Spark Connector

2015-03-02 Thread Rumph, Frens Jan
Hi all,

I didn't find the *issues* button on
https://github.com/datastax/spark-cassandra-connector/ so posting here.

Any one have an idea why token ranges are grouped into one partition per
executor? I expected at least one per core. Any suggestions on how to work
around this? Doing a repartition is way to expensive as I just want more
partitions for parallelism, not reshuffle ...

Thanks in advance!
Frens Jan


Re: Should a node that is bootstrapping be receiving writes in addition to the streams it is receiving?

2015-03-02 Thread Robert Coli
On Mon, Mar 2, 2015 at 1:58 PM, Paulo Ricardo Motta Gomes 
paulo.mo...@chaordicsystems.com wrote:

 I also checked via JMX and all the write counts are zero. Is the node
 supposed to receive writes during bootstrap?


As I understand it, yes.

The other funny thing during bootstrap, is that nodetool status shows that
 the bootsrapping node is Up/Normal (UN), instead of Up/Joining(UJ), is this
 expected or is it a bug? The bootstrapping node does not even appear in the
 nodetool status of other nodes.


Perhaps this node is not actually bootstrapping because you have configured
it as a seed with no other valid seeds listed and so it has started as a
cluster of one?

=Rob


Re: Less frequent flushing with LCS

2015-03-02 Thread Daniel Chia
Do the tables look like they're being flushed every hour? It seems like the
setting memtable_flush_after_mins which I believe defaults to 60 could also
affect how often your tables are flushed.

Thanks,
Daniel

On Mon, Mar 2, 2015 at 11:49 AM, Dan Kinder dkin...@turnitin.com wrote:

 I see, thanks for the input. Compression is not enabled at the moment, but
 I may try increasing that number regardless.

 Also I don't think in-memory tables would work since the dataset is
 actually quite large. The pattern is more like a given set of rows will
 receive many overwriting updates and then not be touched for a while.

 On Fri, Feb 27, 2015 at 2:27 PM, Robert Coli rc...@eventbrite.com wrote:

 On Fri, Feb 27, 2015 at 2:01 PM, Dan Kinder dkin...@turnitin.com wrote:

 Theoretically sstable_size_in_mb could be causing it to flush (it's at
 the default 160MB)... though we are flushing well before we hit 160MB. I
 have not tried changing this but we don't necessarily want all the sstables
 to be large anyway,


 I've always wished that the log message told you *why* the SSTable was
 being flushed, which of the various bounds prompted the flush.

 In your case, the size on disk may be under 160MB because compression is
 enabled. I would start by increasing that size.

 Datastax DSE has in-memory tables for this use case.

 =Rob




 --
 Dan Kinder
 Senior Software Engineer
 Turnitin – www.turnitin.com
 dkin...@turnitin.com



Re: Should a node that is bootstrapping be receiving writes in addition to the streams it is receiving?

2015-03-02 Thread Paulo Ricardo Motta Gomes
I'm also facing a similar issue while bootstrapping a replacement node via
-Dreplace_address flag. The node is streaming data from neighbors, but
cfstats shows 0 counts for all metrics of all CFs in the bootstrapping node:

SSTable count: 0
SSTables in each level: [0, 0, 0, 0, 0, 0, 0, 0, 0]
Space used (live), bytes: 0
Space used (total), bytes: 0
SSTable Compression Ratio: 0.0
Number of keys (estimate): 0
Memtable cell count: 0
Memtable data size, bytes: 0
Memtable switch count: 0
Local read count: 0
Local read latency: 0.000 ms
Local write count: 0
Local write latency: 0.000 ms
Pending tasks: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.0
Bloom filter space used, bytes: 0
Compacted partition minimum bytes: 0
Compacted partition maximum bytes: 0
Compacted partition mean bytes: 0
Average live cells per slice (last five minutes): 0.0
Average tombstones per slice (last five minutes): 0.0

I also checked via JMX and all the write counts are zero. Is the node
supposed to receive writes during bootstrap?

The other funny thing during bootstrap, is that nodetool status shows that
the bootsrapping node is Up/Normal (UN), instead of Up/Joining(UJ), is this
expected or is it a bug? The bootstrapping node does not even appear in the
nodetool status of other nodes.

UN  X.Y.Z.244  15.9 GB1   3.7%
52fb21e-4621-4533-b201-8c1a7adbe818  rack

If I do a nodetool netstats, I see:

Mode: JOINING
Bootstrap 647d4b30-c11e-11e4-9249-173e73521fb44

Cheers,

Paulo

On Thu, Oct 16, 2014 at 3:53 PM, Robert Coli rc...@eventbrite.com wrote:

 On Wed, Oct 15, 2014 at 10:07 PM, Peter Haggerty 
 peter.hagge...@librato.com wrote:

 The node wrote gigs of data to various CFs during the bootstrap so it
 was clearly writing in some sense and it has the expected behavior
 after the bootstrap. Is cfstats correct when it reports that there
 were no writes during a bootstrap?


 As I understand it :

 Writes (extra writes, from the perspective of replication factor, f/e a
 RF=3 cluster has effective RF=4 during bootstrap, but not relevant for
 consistency purposes until end of bootstrap) occur via the storage protocol
 during bootstrap, so I would expect to see those reflected in cfstats.

 I'm relatively confident it is in fact receiving those writes, so your
 confusion might just be a result of how it's reported?

 =Rob
 http://twitter.com/rcolidba




-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Reboot: Read After Write Inconsistent Even On A One Node Cluster

2015-03-02 Thread Robert Coli
On Mon, Mar 2, 2015 at 11:44 AM, Dan Kinder dkin...@turnitin.com wrote:

 I had been having the same problem as in those older post:
 http://mail-archives.apache.org/mod_mbox/cassandra-user/201411.mbox/%3CCAORswtz+W4Eg2CoYdnEcYYxp9dARWsotaCkyvS5M7+Uo6HT1=a...@mail.gmail.com%3E


As I said on that thread :

It sounds unreasonable/unexpected to me, if you have a trivial repro case,
I would file a JIRA.

=Rob


Re: Reboot: Read After Write Inconsistent Even On A One Node Cluster

2015-03-02 Thread Dan Kinder
Yeah I thought that was suspicious too, it's mysterious and fairly
consistent. (By the way I had error checking but removed it for email
brevity, but thanks for verifying :) )

On Mon, Mar 2, 2015 at 4:13 PM, Peter Sanford psanf...@retailnext.net
wrote:

 Hmm. I was able to reproduce the behavior with your go program on my dev
 machine (C* 2.0.12). I was hoping it was going to just be an unchecked
 error from the .Exec() or .Scan(), but that is not the case for me.

 The fact that the issue seems to happen on loop iteration 10, 100 and 1000
 is pretty suspicious. I took a tcpdump to confirm that the gocql was in
 fact sending the write 100 query and then on the next read Cassandra
 responded with 99.

 I'll be interested to see what the result of the jira ticket is.

 -psanford




-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re: Reboot: Read After Write Inconsistent Even On A One Node Cluster

2015-03-02 Thread Peter Sanford
Hmm. I was able to reproduce the behavior with your go program on my dev
machine (C* 2.0.12). I was hoping it was going to just be an unchecked
error from the .Exec() or .Scan(), but that is not the case for me.

The fact that the issue seems to happen on loop iteration 10, 100 and 1000
is pretty suspicious. I took a tcpdump to confirm that the gocql was in
fact sending the write 100 query and then on the next read Cassandra
responded with 99.

I'll be interested to see what the result of the jira ticket is.

-psanford


Re: Reboot: Read After Write Inconsistent Even On A One Node Cluster

2015-03-02 Thread Dan Kinder
Done: https://issues.apache.org/jira/browse/CASSANDRA-8892

On Mon, Mar 2, 2015 at 3:26 PM, Robert Coli rc...@eventbrite.com wrote:

 On Mon, Mar 2, 2015 at 11:44 AM, Dan Kinder dkin...@turnitin.com wrote:

 I had been having the same problem as in those older post:
 http://mail-archives.apache.org/mod_mbox/cassandra-user/201411.mbox/%3CCAORswtz+W4Eg2CoYdnEcYYxp9dARWsotaCkyvS5M7+Uo6HT1=a...@mail.gmail.com%3E


 As I said on that thread :

 It sounds unreasonable/unexpected to me, if you have a trivial repro
 case, I would file a JIRA.

 =Rob




-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re: Node stuck in joining the ring

2015-03-02 Thread Phil Yang
I encountered a similar situation that streaming can not finish, not only
in joining but in removing a node. My tricky solution is: restart every
node in the cluster before you starting the new node. In my experience
streaming stucked only shows in the node that have been running many days
although I have no idea about the reason.

2015-03-03 2:42 GMT+08:00 Nate McCall n...@thelastpickle.com:

 Can you verify that casssandra-rackdc.properties and
 cassandra-topology.properties are the same on the cluster?

 On Thu, Feb 26, 2015 at 7:52 AM, Batranut Bogdan batra...@yahoo.com
 wrote:

 No errors in the system.log file
 [root@cassa09 cassandra]# grep ERROR system.log
 [root@cassa09 cassandra]#

 Nothing.


   On Thursday, February 26, 2015 1:55 PM, mck m...@apache.org wrote:


 Any errors in your log file?

 We saw something similar when bootstrap crashed when rebuilding
 secondary indexes.

 See CASSANDRA-8798

 ~mck






 --
 -
 Nate McCall
 Austin, TX
 @zznate

 Co-Founder  Sr. Technical Consultant
 Apache Cassandra Consulting
 http://www.thelastpickle.com




-- 
Thanks,
Phil Yang


Re: using or in select query in cassandra

2015-03-02 Thread Jens Rantil
Hi Rahul,

No, you can't do this in a single query. You will need to execute two
separate queries if the requirements are on different columns. However, if
you'd like to select multiple rows of with restriction on the same column
you can do that using the `IN` construct:

select * from table where id IN (123,124);

See [1] for reference.

[1]
http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/select_r.html

Cheers,
Jens

On Mon, Mar 2, 2015 at 7:06 AM, Rahul Srivastava 
srivastava.robi...@gmail.com wrote:

 Hi
  I want to make uniqueness for my data so i need to add OR clause  in my
 WHERE clause.
 ex: select * from table where id =123 OR name ='abc'
 so in above i want that i get data if my id is 123 or my name is abc .

 is there any possibility in cassandra to achieve this .




-- 
Jens Rantil
Backend engineer
Tink AB

Email: jens.ran...@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook https://www.facebook.com/#!/tink.se Linkedin
http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
 Twitter https://twitter.com/tink


Re: how to make unique coloumns in cassandra

2015-03-02 Thread Peter Lin
Use a RDBMS 

There is a reason constraints were created and why Cassandra doesn't have it

Sent from my iPhone

 On Mar 2, 2015, at 2:23 AM, Rahul Srivastava srivastava.robi...@gmail.com 
 wrote:
 
 but what if i want to fetch the value using on table then this idea might fail
 
 On Mon, Mar 2, 2015 at 12:46 PM, Ajaya Agrawal ajku@gmail.com wrote:
 Make a table for each of the unique keys. For e.g.
 
 If primary key for user table is user_id and you want the phone number 
 column to be unique then create another table wherein the primary key is 
 (phone_number, user_id). Before inserting to main table try to insert to 
 this table first with if not exists clause. If it succeeds then go ahead 
 with your insert to the user table. Similarly while deleting a row from the 
 primary table delete the corresponding row in all other tables. The order of 
 insertion to tables matter here otherwise you would end up inducing race 
 conditions.
 
 The catch here is, you should not be updating the unique column ever. If you 
 do that you would have to use locks and if there are multiple nodes running 
 your application then you would need a distributed lock. I would suggest not 
 to update the unique columns. In stead force your users to delete the entry 
 and recreate it. If you can't do that you need to evaluate your choice of 
 database. Perhaps a relational database would be better suited to your 
 requirements.
 
 Hope this helps!
 
 -Ajaya
 
 Cheers,
 Ajaya
 
 On Fri, Feb 27, 2015 at 5:26 PM, ROBIN SRIVASTAVA 
 srivastava.robi...@gmail.com wrote:
 I want to make unique constraint in cassandra . As i want that all the 
 value in my column be unique in my column family ex: name-rahul phone-123 
 address-abc
 
 now i want that in this row no values equal to rahul ,123 and abc get 
 inserted again on searching on datastax i found that i can achieve it by 
 doing query on partition key as IF NOT EXIST ,but not getting the solution 
 for getting all the three values unique means if name- jacob phone-123 
 address-qwe
 
 this should also be not inserted into my database as my phone column has 
 the same value as i have shown with name-rahul.
 
 


Re: How to extract all the user id from a single table in Cassandra?

2015-03-02 Thread Jens Rantil
Hi Check,

Please avoid double posting on mailing lists. It leads to double work
(respect people's time!) and makes it hard for people in the future having
the same issue as you to follow discussions and answers.

That said, if you have a lot of primary keys

select user_id from testkeyspace.user_record;

will most definitely timeout. Have a look at `SELECT DISTINCT` at [1]. More
importantly, for larger datasets you will also need to split the token
space into smaller segments and iteratively select your primary keys. See
[2].

[1]
http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/select_r.html
[2]
http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/select_r.html?scroll=reference_ds_d35_v2q_xj__paging-through-unordered-results

If you are having specific issues with the Java Driver I suggest you ask on
that mailing list (only).

Cheers,
Jens

On Sun, Mar 1, 2015 at 6:38 PM, Check Peck comptechge...@gmail.com wrote:

 Sending again as I didn't got any response on this.

 Any thoughts?

 On Fri, Feb 27, 2015 at 8:24 PM, Check Peck comptechge...@gmail.com
 wrote:

 I have a Cassandra table like this -

 create table user_record (user_id text, record_name text,
 record_value blob, primary key (user_id, record_name));

 What is the best way to extract all the user_id from this table? As of
 now, I cannot change my data model to do this exercise so I need to find a
 way by which I can extract all the user_id from the above table.

 I am using Datastax Java driver in my project. Is there any other easy
 way apart from code to extract all the user_id from the above table through
 come cqlsh utility and dump it into some file?

 I am thinking below code might timed out after some time -

 public class TestCassandra {

 private Session session = null;
 private Cluster cluster = null;

 private static class ConnectionHolder {
 static final TestCassandra connection = new
 TestCassandra();
 }

 public static TestCassandra getInstance() {
 return ConnectionHolder.connection;
 }

 private TestCassandra() {
 Builder builder = Cluster.builder();
 builder.addContactPoints(127.0.0.1);

 PoolingOptions opts = new PoolingOptions();
 opts.setCoreConnectionsPerHost(HostDistance.LOCAL,
 opts.getCoreConnectionsPerHost(HostDistance.LOCAL));

 cluster =
 builder.withRetryPolicy(DowngradingConsistencyRetryPolicy.INSTANCE).withPoolingOptions(opts)
 .withLoadBalancingPolicy(new TokenAwarePolicy(new
 DCAwareRoundRobinPolicy(PI)))
 .withReconnectionPolicy(new
 ConstantReconnectionPolicy(100L))
 .build();
 session = cluster.connect();
 }

 private SetString getRandomUsers() {
 SetString userList = new HashSetString();

 String sql = select user_id from testkeyspace.user_record;;

 try {
 SimpleStatement query = new SimpleStatement(sql);
 query.setConsistencyLevel(ConsistencyLevel.ONE);
 ResultSet res = session.execute(query);

 IteratorRow rows = res.iterator();
 while (rows.hasNext()) {
 Row r = rows.next();

 String user_id = r.getString(user_id);
 userList.add(user_id);
 }
 } catch (Exception e) {
 System.out.println(error=  + e);
 }

 return userList;
 }
 }

 Adding java-driver group and Cassandra group as well to see whether there
 is any better way to execute this?





-- 
Jens Rantil
Backend engineer
Tink AB

Email: jens.ran...@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook https://www.facebook.com/#!/tink.se Linkedin
http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
 Twitter https://twitter.com/tink


Re: Composite Keys in cassandra 1.2

2015-03-02 Thread Kai Wang
AFIK it's not possible. The fact you need to query the data by partial row
key indicates your data model isn't proper. What are your typical queries
on the data?

On Sun, Mar 1, 2015 at 7:24 AM, Yulian Oifa oifa.yul...@gmail.com wrote:

 Hello to all.
 Lets assume a scenario where key is compound type with 3 types in it (
 Long , UTF8, UTF8 ).
 Each row stores timeuuids as column names and empty values.
 Is it possible to retreive data by single key part ( for example by long
 only ) by using java thrift?

 Best regards
 Yulian Oifa





Re: Optimal Batch size (Unlogged) for Java driver

2015-03-02 Thread Ajay
I have a column family with 15 columns where there are timestamp,
timeuuid,  few text fields and rest int  fields.  If I calculate the size
of its column name  and it's value and divide 5kb (recommended max size for
batch) with the value,  I get result as 12. Is it correct?. Am I missing
something?

Thanks
Ajay
On 02-Mar-2015 12:13 pm, Ankush Goyal ank...@gmail.com wrote:

 Hi Ajay,

 I would suggest, looking at the approximate size of individual elements in
 the batch, and based on that compute max size (chunk size).

 Its not really a straightforward calculation, so I would further suggest
 making that chunk size a runtime parameter that you can tweak and play
 around with until you reach stable state.

 On Sunday, March 1, 2015 at 10:06:55 PM UTC-8, Ajay Garga wrote:

 Hi,

 I am looking at a way to compute the optimal batch size in the client
 side similar to the below mentioned bug in the server side (generic as we
 are exposing REST APIs for Cassandra, the column family and the data are
 different each request).

 https://issues.apache.org/jira/browse/CASSANDRA-6487
 https://www.google.com/url?q=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FCASSANDRA-6487sa=Dsntz=1usg=AFQjCNGOSliZnS1idXqTHXIr7aNfEN3mMg

 How do we compute(approximately using ColumnDefintions or ColumnMetadata)
 the size of a row of a column family from the client side using Cassandra
 Java driver?

 Thanks
 Ajay

  To unsubscribe from this group and stop receiving emails from it, send an
 email to java-driver-user+unsubscr...@lists.datastax.com.



Re: Optimal Batch size (Unlogged) for Java driver

2015-03-02 Thread Ajay
Hi Ankush,

We are already using Prepared statement and our case is a time series data
as well.

Thanks
Ajay
On 02-Mar-2015 10:00 pm, Ankush Goyal ank...@gmail.com wrote:

 Ajay,

 First of all, I would recommend using PreparedStatements, so you only
 would be sending the variable bound arguments over the wire. Second, I
 think that 5kb limit for WARN is too restrictive, and you could tune that
 on cassandra server side. I think if all you have is 15 columns (as long as
 their values are sanitized and do not go over certain limits), it should be
 fine to send all of them over at the same time. Chunking is necessary, when
 you have time-series type data (for writes) OR you might be reading a lot
 of data via IN query.

 On Monday, March 2, 2015 at 7:55:18 AM UTC-8, Ajay Garga wrote:

 I have a column family with 15 columns where there are timestamp,
 timeuuid,  few text fields and rest int  fields.  If I calculate the size
 of its column name  and it's value and divide 5kb (recommended max size for
 batch) with the value,  I get result as 12. Is it correct?. Am I missing
 something?

 Thanks
 Ajay
 On 02-Mar-2015 12:13 pm, Ankush Goyal ank...@gmail.com wrote:

 Hi Ajay,

 I would suggest, looking at the approximate size of individual elements
 in the batch, and based on that compute max size (chunk size).

 Its not really a straightforward calculation, so I would further suggest
 making that chunk size a runtime parameter that you can tweak and play
 around with until you reach stable state.

 On Sunday, March 1, 2015 at 10:06:55 PM UTC-8, Ajay Garga wrote:

 Hi,

 I am looking at a way to compute the optimal batch size in the client
 side similar to the below mentioned bug in the server side (generic as we
 are exposing REST APIs for Cassandra, the column family and the data are
 different each request).

 https://issues.apache.org/jira/browse/CASSANDRA-6487
 https://www.google.com/url?q=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FCASSANDRA-6487sa=Dsntz=1usg=AFQjCNGOSliZnS1idXqTHXIr7aNfEN3mMg

 How do we compute(approximately using ColumnDefintions or
 ColumnMetadata) the size of a row of a column family from the client side
 using Cassandra Java driver?

 Thanks
 Ajay

  To unsubscribe from this group and stop receiving emails from it, send
 an email to java-driver-us...@lists.datastax.com.

  To unsubscribe from this group and stop receiving emails from it, send
 an email to java-driver-user+unsubscr...@lists.datastax.com.



Re: using or in select query in cassandra

2015-03-02 Thread Jonathan Haddad
I'd like to add that in() is usually a bad idea.  It is convenient, but not
really what you want in production.  Go with Jens' original suggestion of
multiple queries.

I recommend reading Ryan Svihla's post on why in() is generally a bad
thing:
http://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/

On Mon, Mar 2, 2015 at 12:36 AM Jens Rantil jens.ran...@tink.se wrote:

 Hi Rahul,

 No, you can't do this in a single query. You will need to execute two
 separate queries if the requirements are on different columns. However, if
 you'd like to select multiple rows of with restriction on the same column
 you can do that using the `IN` construct:

 select * from table where id IN (123,124);

 See [1] for reference.

 [1]
 http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/select_r.html

 Cheers,
 Jens

 On Mon, Mar 2, 2015 at 7:06 AM, Rahul Srivastava 
 srivastava.robi...@gmail.com wrote:

 Hi
  I want to make uniqueness for my data so i need to add OR clause  in my
 WHERE clause.
 ex: select * from table where id =123 OR name ='abc'
 so in above i want that i get data if my id is 123 or my name is abc .

 is there any possibility in cassandra to achieve this .




 --
 Jens Rantil
 Backend engineer
 Tink AB

 Email: jens.ran...@tink.se
 Phone: +46 708 84 18 32
 Web: www.tink.se

 Facebook https://www.facebook.com/#!/tink.se Linkedin
 http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
  Twitter https://twitter.com/tink



Datastax Agent 5.1+ Configuration

2015-03-02 Thread Robert Halstead
I recently attempted to get our cassandra instances talking securely to one
another with ssl opscenter communication.  We are using DSE 4.6, opscenter
5.1.  While a lot of the datastax documentation is fairly good, when it
comes to advanced configuration topics or security configuration, I find
the docs very lacking.

I setup a 3 node cluster with SSL encryption between nodes and
PasswordAuthentication turned on.  As it being obvious, you need to setup
the user/pass in the agent configuration as well.  These used to be
thrift_user and thrift_pass (or something along those lines) and the ssl
was thrift_keystore / thrift_truststore, etc..

In Opscenter 5.1, the system changed from using thrift to the native
interface.  However there is nothing in the docs about what agent
properties do you need to set for the ssl security and authentication.

After my dealings with Datastax Support, I thought I would post this here
until they update their documentation.

Agent configuration (address.yaml)

C* connection options

*IP addresses

Before 5.1, we were using either thrift_rpc_interface (when storing
metrics/settings in the same cluster) or storage_thrift_hosts
(separate cluster) to determine what IP to use to connect to C*. In
5.1, both options were replaced with hosts, that accepts an array of
strings (including an array w/ a single string for the same cluster
case) instead of a single string:

hosts: [123.234.111.11, 10.1.1.1]

C* port
storage_thrift_port was removed, thrift_port was supplemented by cassandra_port

C* autodiscovery
autodiscovery_enabled, autodiscovery_interval, and storage_dc were
removed, autodiscovery can’t really be disabled for our java-driver,
but we never connect to hosts that are not specified in the agent’s
config.

Misc
thrift_socket_timeout and thrift_conn_timeout were removed.

C*/DSE security
PLAINTEXT AUTH
thrift_user, storage_thrift_user, thift_pass, and storage_thrift_pass
were replaced by cassandra_user  cassandra_pass

ENCRYPTION
thrift_ssl_truststore and thrift_ssl_truststore_password were replaced
by ssl_keystore and ssl_keystore_password, respectively.
thrift_ssl_truststore_type, thrift_max_frame_size were removed.

KERBEROS
We completely changed the way we setup kerberos (I thought it was
doc’d but apparently it wasn’t). We removed everything
kerberos-related from the config except for a single option,
kerberos_service. When it’s set (to the Kerberos service name) we’re
using kerberos. All the configuration takes place in the
kerberos.config file.
opscenterd cluster configs

[cassandra]
send_thrift_rpc was renamed to be thrift_rpc

[agents]
thrift_ssl_truststore and thrift_ssl_truststore_password were renamed
to ssl_keystore and ssl_keystore_password, respectively.
thrift_ssl_truststore_type was removed.

Hopefully this will be helpful for those running the latest opscenter and
want a secure setup.

Thanks to datastax for the help in this matter.


Re: using or in select query in cassandra

2015-03-02 Thread Robert Wille
I would also like to add that if you avoid IN and use async queries instead, it 
is pretty trivial to use a semaphore or some other limiting mechanism to put a 
ceiling on the amount on concurrent work you are sending to the cluster. If you 
use a query with an IN clause with a thousand things, you’ll make the cluster 
look for a thousand records concurrently. If you issue a thousand asyncQueries, 
and use a limiting mechanism, then you can control how much load you are 
placing on the server.

I built a nice wrapper around the Session object, and one of the things that is 
built into the wrapper is the ability to limit the number of concurrent async 
queries. It’s a really nice and simple feature to have.

Robert

On Mar 2, 2015, at 10:33 AM, Jonathan Haddad 
j...@jonhaddad.commailto:j...@jonhaddad.com wrote:

I'd like to add that in() is usually a bad idea.  It is convenient, but not 
really what you want in production.  Go with Jens' original suggestion of 
multiple queries.

I recommend reading Ryan Svihla's post on why in() is generally a bad thing: 
http://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/

On Mon, Mar 2, 2015 at 12:36 AM Jens Rantil 
jens.ran...@tink.semailto:jens.ran...@tink.se wrote:
Hi Rahul,

No, you can't do this in a single query. You will need to execute two separate 
queries if the requirements are on different columns. However, if you'd like to 
select multiple rows of with restriction on the same column you can do that 
using the `IN` construct:

select * from table where id IN (123,124);

See [1] for reference.

[1] 
http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/select_r.html

Cheers,
Jens

On Mon, Mar 2, 2015 at 7:06 AM, Rahul Srivastava 
srivastava.robi...@gmail.commailto:srivastava.robi...@gmail.com wrote:
Hi
 I want to make uniqueness for my data so i need to add OR clause  in my WHERE 
clause.
ex: select * from table where id =123 OR name ='abc'
so in above i want that i get data if my id is 123 or my name is abc .

is there any possibility in cassandra to achieve this .




--
Jens Rantil
Backend engineer
Tink AB

Email: jens.ran...@tink.semailto:jens.ran...@tink.se
Phone: +46 708 84 18 32
Web: www.tink.sehttp://www.tink.se/

Facebookhttps://www.facebook.com/#!/tink.se 
Linkedinhttp://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
 Twitterhttps://twitter.com/tink



Re: Running Cassandra on mixed OS

2015-03-02 Thread Jonathan Haddad
I would really not recommend this.  There's enough issues that can come up
with a distributed database that can make it hard to pinpoint problems.

In an ideal world, every machine would be completely identical.  Don't set
yourself up for fail.  Pin the OS  all packages to specific versions.

On Mon, Mar 2, 2015 at 6:44 AM sean_r_dur...@homedepot.com wrote:

  Cassandra 1.2.13+/2.0.12



 Have any of you run a single Cassandra cluster on a mix of OS (Red Hat 5
 and 6, for example), but with the same JVM? Any issues or concerns? If
 there are problems, how do you handle OS upgrades?







 Sean R. Durity

 --

 The information in this Internet Email is confidential and may be legally
 privileged. It is intended solely for the addressee. Access to this Email
 by anyone else is unauthorized. If you are not the intended recipient, any
 disclosure, copying, distribution or any action taken or omitted to be
 taken in reliance on it, is prohibited and may be unlawful. When addressed
 to our clients any opinions or advice contained in this Email are subject
 to the terms and conditions expressed in any applicable governing The Home
 Depot terms of business or client engagement letter. The Home Depot
 disclaims all responsibility and liability for the accuracy and content of
 this attachment and for any damages or losses arising from any
 inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
 items of a destructive nature, which may be contained in this attachment
 and shall not be liable for direct, indirect, consequential or special
 damages in connection with this e-mail message or its attachment.



RE: Running Cassandra on mixed OS

2015-03-02 Thread SEAN_R_DURITY
This is not for the long haul, but in order to accomplish an OS upgrade across 
the cluster, without taking an outage.

Sean Durity

From: Jonathan Haddad [mailto:j...@jonhaddad.com]
Sent: Monday, March 02, 2015 1:15 PM
To: user@cassandra.apache.org
Subject: Re: Running Cassandra on mixed OS

I would really not recommend this.  There's enough issues that can come up with 
a distributed database that can make it hard to pinpoint problems.

In an ideal world, every machine would be completely identical.  Don't set 
yourself up for fail.  Pin the OS  all packages to specific versions.

On Mon, Mar 2, 2015 at 6:44 AM 
sean_r_dur...@homedepot.commailto:sean_r_dur...@homedepot.com wrote:
Cassandra 1.2.13+/2.0.12

Have any of you run a single Cassandra cluster on a mix of OS (Red Hat 5 and 6, 
for example), but with the same JVM? Any issues or concerns? If there are 
problems, how do you handle OS upgrades?



Sean R. Durity



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.


Re: Running Cassandra on mixed OS

2015-03-02 Thread Robert Coli
On Mon, Mar 2, 2015 at 6:43 AM, sean_r_dur...@homedepot.com wrote:

  Have any of you run a single Cassandra cluster on a mix of OS (Red Hat 5
 and 6, for example), but with the same JVM? Any issues or concerns? If
 there are problems, how do you handle OS upgrades?


If you are running the same version of Cassandra in both cases, you are
probably fine. As you point out, one must inevitably upgrade one's OS; I
recently went from ubuntu 1004 to 1204 (with associated (1.6/1.7) JVMs)
without any problems.

But you should of course do any such activity in QA and staging and let it
burn in for a while before doing so in prod.

=Rob


Re: Node stuck in joining the ring

2015-03-02 Thread Nate McCall
Can you verify that casssandra-rackdc.properties and
cassandra-topology.properties are the same on the cluster?

On Thu, Feb 26, 2015 at 7:52 AM, Batranut Bogdan batra...@yahoo.com wrote:

 No errors in the system.log file
 [root@cassa09 cassandra]# grep ERROR system.log
 [root@cassa09 cassandra]#

 Nothing.


   On Thursday, February 26, 2015 1:55 PM, mck m...@apache.org wrote:


 Any errors in your log file?

 We saw something similar when bootstrap crashed when rebuilding
 secondary indexes.

 See CASSANDRA-8798

 ~mck






-- 
-
Nate McCall
Austin, TX
@zznate

Co-Founder  Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com


Re: sstables remain after compaction

2015-03-02 Thread Robert Coli
On Sat, Feb 28, 2015 at 5:39 PM, Jason Wee peich...@gmail.com wrote:

 Hi Rob, sorry for the late response, festive season here. cassandra
 version is 1.0.8 and thank you, I will read on the READ_STAGE threads.


1.0.8 is pretty seriously old in 2015. I would upgrade to at least 1.2.x
(via 1.1.x) ASAP. Your cluster will be much happier, in general.

=Rob


set selinux context for cassandra to talk to website

2015-03-02 Thread Tim Dunphy
Hey all,

 Ok I have a website being powered by Cassandra 2.1.3. And I notice if
selinux is set to off, the site works beautifully! However as soon as I set
selinux to on, I am seeing the following error:

Warning: require_once(/McFrazier/PhpBinaryCql/CqlClient.php): failed to
open stream: Permission denied in
/var/www/jf-ref/includes/classes/class.CQL.php on line 2 Fatal error:
require_once(): Failed opening required
'/McFrazier/PhpBinaryCql/CqlClient.php' (include_path='.:/php/includes') in
/var/www/jf-ref/includes/classes/class.CQL.php on line 2

I'm just wondering how I can get SELinux to allow Cassandra to connect to
the web server?

Thanks
Tim

-- 
GPG me!!

gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B


Re: does need to disable 'rpc_keepalive' if 'rpc_max_threads' is get larger?

2015-03-02 Thread Robert Coli
On Sun, Mar 1, 2015 at 6:40 PM, pprun pprun.dra...@gmail.com wrote:

 rpc_max_threads is set to 2048 and the 'rpc_server_type' is 'hsha', after
 2 days running, observed that there's a high I/O activity and the number of
 'RCP thread' grow to '2048' and VisualVm shows most of them is
 'waiting'/'sleeping' (color: yellow).

 I want to know if I set rpc_keepalive to false, disable it, this will help
 to shrink the idle rpc threads?

 I remembered java 8 comes with newWorkStealingPool: number of threads may
 grow and shrink dynamically.


What version of Cassandra, and hsha or sync? How many client threads do you
have?

=Rob


Re: Less frequent flushing with LCS

2015-03-02 Thread Dan Kinder
I see, thanks for the input. Compression is not enabled at the moment, but
I may try increasing that number regardless.

Also I don't think in-memory tables would work since the dataset is
actually quite large. The pattern is more like a given set of rows will
receive many overwriting updates and then not be touched for a while.

On Fri, Feb 27, 2015 at 2:27 PM, Robert Coli rc...@eventbrite.com wrote:

 On Fri, Feb 27, 2015 at 2:01 PM, Dan Kinder dkin...@turnitin.com wrote:

 Theoretically sstable_size_in_mb could be causing it to flush (it's at
 the default 160MB)... though we are flushing well before we hit 160MB. I
 have not tried changing this but we don't necessarily want all the sstables
 to be large anyway,


 I've always wished that the log message told you *why* the SSTable was
 being flushed, which of the various bounds prompted the flush.

 In your case, the size on disk may be under 160MB because compression is
 enabled. I would start by increasing that size.

 Datastax DSE has in-memory tables for this use case.

 =Rob




-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Reboot: Read After Write Inconsistent Even On A One Node Cluster

2015-03-02 Thread Dan Kinder
Hey all,

I had been having the same problem as in those older post:
http://mail-archives.apache.org/mod_mbox/cassandra-user/201411.mbox/%3CCAORswtz+W4Eg2CoYdnEcYYxp9dARWsotaCkyvS5M7+Uo6HT1=a...@mail.gmail.com%3E

To summarize it, on my local box with just one cassandra node I can update
and then select the updated row and get an incorrect response.

My understanding is this may have to do with not having fine-grained enough
timestamp resolution, but regardless I'm wondering: is this actually a bug
or is there any way to mitigate it? It causes sporadic failures in our unit
tests, and having to Sleep() between tests isn't ideal. At least confirming
it's a bug would be nice though.

For those interested, here's a little go program that can reproduce the
issue. When I run it I typically see:
Expected 100 but got: 99
Expected 1000 but got: 999

--- main.go: ---

package main

import (
fmt

github.com/gocql/gocql
)

func main() {
cf := gocql.NewCluster(localhost)
db, _ := cf.CreateSession()
// Keyspace ut = update test
err := db.Query(`CREATE KEYSPACE IF NOT EXISTS ut
WITH REPLICATION = {'class': 'SimpleStrategy',
'replication_factor': 1 }`).Exec()
if err != nil {
panic(err.Error())
}
err = db.Query(CREATE TABLE IF NOT EXISTS ut.test (key text, val text,
PRIMARY KEY(key))).Exec()
if err != nil {panic(err.Error())
   }
err = db.Query(TRUNCATE ut.test).Exec()
if err != nil {
panic(err.Error())

}

err = db.Query(INSERT INTO ut.test (key) VALUES ('foo')).Exec()

if err != nil {

panic(err.Error())

}


for i := 0; i  1; i++ {

val := fmt.Sprintf(%d, i)

db.Query(UPDATE ut.test SET val = ? WHERE key = 'foo',
val).Exec()


var result string
db.Query(SELECT val FROM ut.test WHERE key = 'foo').Scan(result)
if result != val {
fmt.Printf(Expected %v but got: %v\n, val, result)
}
}

}


best practices for time-series data with massive amounts of records

2015-03-02 Thread Clint Kelly
Hi all,

I am designing an application that will capture time series data where we
expect the number of records per user to potentially be extremely high.  I
am not sure if we will eclipse the max row size of 2B elements, but I
assume that we would not want our application to approach that size anyway.

If we wanted to put all of the interactions in a single row, then I would
make a data model that looks like:

CREATE TABLE events (
  id text,
  event_time timestamp,
  event blob,
  PRIMARY KEY (id, event_time))
WITH CLUSTERING ORDER BY (event_time DESC);

The best practice for breaking up large rows of time series data is, as I
understand it, to put part of the time into the partitioning key (
http://planetcassandra.org/getting-started-with-time-series-data-modeling/):

CREATE TABLE events (
  id text,
  date text, // Could also use year+month here or year+week or something
else
  event_time timestamp,
  event blob,
  PRIMARY KEY ((id, date), event_time))
WITH CLUSTERING ORDER BY (event_time DESC);

The downside of this approach is that we can no longer do a simple
continuous scan to get all of the events for a given user.  Some users may
log lots and lots of interactions every day, while others may interact with
our application infrequently, so I'd like a quick way to get the most
recent interaction for a given user.

Has anyone used different approaches for this problem?

The only thing I can think of is to use the second table schema described
above, but switch to an order-preserving hashing function, and then
manually hash the id field.  This is essentially what we would do in
HBase.

Curious if anyone else has any thoughts.

Best regards,
Clint


RE: sstables remain after compaction

2015-03-02 Thread SEAN_R_DURITY
In my experience, you do not want to stay on 1.1 very long. 1.08 was very 
stable. 1.1 can get bad in a hurry. 1.2 (with many things moved off-heap) is 
very much better.


Sean Durity – Cassandra Admin, Big Data Team

From: Robert Coli [mailto:rc...@eventbrite.com]
Sent: Monday, March 02, 2015 2:01 PM
To: user@cassandra.apache.org
Subject: Re: sstables remain after compaction

On Sat, Feb 28, 2015 at 5:39 PM, Jason Wee 
peich...@gmail.commailto:peich...@gmail.com wrote:
Hi Rob, sorry for the late response, festive season here. cassandra version is 
1.0.8 and thank you, I will read on the READ_STAGE threads.

1.0.8 is pretty seriously old in 2015. I would upgrade to at least 1.2.x (via 
1.1.x) ASAP. Your cluster will be much happier, in general.

=Rob




The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.


Re: What are the factors that affect the release time of each minor version?

2015-03-02 Thread Aleksey Yeschenko
Hi Phil,

Right now there is no explicit scheme for minor releases scheduling.

Eventually we just decide that it’s time for a new release - usually when the 
CHANGES list feels too long - and start the process.

what are the duties to release a version?
Need to build and eventually publish all the artifacts - that process is 
semi-automated. Need to run all the unit/distributed/long/duration tests on the 
tagged sha. Need to go through the whole voting process. All in all about a 
week.

Coincidentally, we’ve been discussing our release policies recently, and minor 
version releases have been discussed as well. It is likely that we’ll switch to 
scheduled minor releases soon (every 2/3/4 weeks or when a major bug gets fixed 
- whatever comes first).

-- 
AY

On February 28, 2015 at 2:49:25 AM, Phil Yang (ud1...@gmail.com) wrote:

Hi all  

As a user of Cassandra, sometimes there are some bugs in my cluster and I  
hope someone can fix them (Of course, if I can fix them myself I'll try to  
contribute my code :) ). For each bug, there is a JIRA ticket to tracking  
it and users can know if the bug is fixed.  

However, there is a lag between this bug being fixed and a new minor  
version being released. Although we can apply the patch of this ticket to  
our online version and build a special snapshot to solve the trouble in our  
clusters or we can use the latest code directly, I think many users still  
want to use an official release with higher reliability and indeed, more  
convenience. In addition, updating more frequently can also reduce the  
trouble causing by unknown bugs. So someone may often ask When the new  
version with this patch will be released?  

In my mind, not only the number of issues resolved in each version but also  
the time interval between two versions is not fixed. So may I know what the  
factors that affect the release time of each minor version?  

Furthermore, except a vote in dev@cassandra maillist that I can see, what  
are the duties to release a version? If it is not a heavy work, could we  
make each release more frequently? Or we may make a rule to decide if we  
need release a new version? For example: If the latest version was  
released two weeks ago, or after the latest version we have already  
resolved 20 issues, we should release a new minor version.  

--  
Thanks,  
Phil Yang  


Re: how to make unique coloumns in cassandra

2015-03-02 Thread Ajaya Agrawal
Please be clear on questions and spend some time on writing questions so
that other people know what you are trying to ask. I can't read your mind.
:)

Back to your question:
Assuming that you need to search based on the values of the unique column
then invert the index on auxiliary table. So instead of (phone_number,
user_id) index you would have to have (user_id, phone_number) as index.
Then do a query on the auxiliary table and then on user table if you want
other columns. You can replicate other columns in the auxiliary table also
to avoid multiple queries.

Cheers,
Ajaya

On Mon, Mar 2, 2015 at 12:53 PM, Rahul Srivastava 
srivastava.robi...@gmail.com wrote:

 but what if i want to fetch the value using on table then this idea might
 fail

 On Mon, Mar 2, 2015 at 12:46 PM, Ajaya Agrawal ajku@gmail.com wrote:

 Make a table for each of the unique keys. For e.g.

 If primary key for user table is user_id and you want the phone number
 column to be unique then create another table wherein the primary key is
 (phone_number, user_id). Before inserting to main table try to insert to
 this table first with if not exists clause. If it succeeds then go ahead
 with your insert to the user table. Similarly while deleting a row from the
 primary table delete the corresponding row in all other tables. The order
 of insertion to tables matter here otherwise you would end up inducing race
 conditions.

 The catch here is, you should not be updating the unique column ever. If
 you do that you would have to use locks and if there are multiple nodes
 running your application then you would need a distributed lock. I would
 suggest not to update the unique columns. In stead force your users to
 delete the entry and recreate it. If you can't do that you need to evaluate
 your choice of database. Perhaps a relational database would be better
 suited to your requirements.

 Hope this helps!

 -Ajaya

 Cheers,
 Ajaya

 On Fri, Feb 27, 2015 at 5:26 PM, ROBIN SRIVASTAVA 
 srivastava.robi...@gmail.com wrote:

 I want to make unique constraint in cassandra . As i want that all the
 value in my column be unique in my column family ex: name-rahul phone-123
 address-abc

 now i want that in this row no values equal to rahul ,123 and abc get
 inserted again on searching on datastax i found that i can achieve it by
 doing query on partition key as IF NOT EXIST ,but not getting the solution
 for getting all the three values unique means if name- jacob phone-123
 address-qwe

 this should also be not inserted into my database as my phone column has
 the same value as i have shown with name-rahul.