Re: Cassandra Data Archiving

2012-06-01 Thread Shubham Srivastava
Samal that's pretty smart stuff


From: samal [mailto:samalgo...@gmail.com]
Sent: Friday, June 01, 2012 11:24 AM
To: user@cassandra.apache.org user@cassandra.apache.org
Subject: Re: Cassandra Data Archiving

I believe you are talking about HDD space, consumed by user generated data 
which is no longer required after 15 days or may required.
First case to use TTL which you don't wan to use. 2nd as aaron pointed 
snapshotting data, but data still exist in cluster, only used for back up.

I think of like using column family bucket, 15 day a bucket , 2 bucket a month.

Creating new cf every 15th day with time-stamp marker trip_offer_cf_[ts 
-ts%(86400*15)], caching cf name in app for 15 days, after 15th day old cf 
bucket will be read only, no write goes into it, snapshotting that 
old_cf_bucket _data, and deleting that cf few days later, this will keep cf 
count fixed.

current cf count=n,
bucket cf count= b*n

using separate cluster old data analytic.

/Samal

On Fri, Jun 1, 2012 at 9:58 AM, Harshvardhan Ojha 
harshvardhan.o...@makemytrip.commailto:harshvardhan.o...@makemytrip.com 
wrote:
Problem statement:
We are keeping daily generated data(user generated content)  in Cassandra, but 
our application is using only 15 days old data. So how can we archive data 
older than 15 days so that we can reduce load on Cassandra ring.

Note : we can’t apply TTL, as this data may be needed in future.


From: aaron morton 
[mailto:aa...@thelastpickle.commailto:aa...@thelastpickle.com]
Sent: Friday, June 01, 2012 6:57 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Cassandra Data Archiving

I'm not sure on your needs, but the simplest thing to consider is snapshotting 
and copying off node.

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 1/06/2012, at 12:23 AM, Shubham Srivastava wrote:


I need to archive my Cassandra data into another  permanent storage .

Two intent

1.To shed the unused data from the Live data.

2.To use the archived data for getting some analytics out or a potential source 
of DataWarehouse.

Any recommendations for the same in terms of strategies or tools to use.

Regards,
Shubham Srivastava | Technical Lead - Technology Development

+91 124 4910 548   |  MakeMyTrip.comhttp://MakeMyTrip.com, 243 SP Infocity, 
Udyog Vihar Phase 1, Gurgaon, Haryana - 122 016, India


image001.gifWhat's new? My Trip Rewards - An exclusive loyalty program for 
MakeMyTrip customers.https://rewards.makemytrip.com/MTR

image002.gifhttp://www.makemytrip.com/

image003.gifhttp://www.makemytrip.com/support/gurgaon-travel-agent-office.php
Office Map

image004.gifhttp://www.facebook.com/pages/MakeMyTrip-Deals/120740541030?ref=searchsid=10077980239.1422657277..1
Facebook

image005.gifhttp://twitter.com/makemytripdeals
Twitter







Re: About Composite range queries

2012-06-01 Thread Cyril Auburtin
ok sorry I thought columns inside a row had their keys hashed also
So they are just putted as raw bytes

thx

2012/6/1 aaron morton aa...@thelastpickle.com

 If you hash 4 composite keys, let's say
 ('A','B','C'), ('A','D','C'), ('A','E','X'), ('A','R','X'), you have only 4
 hashes or you have more?

 Four

 If it's 4, how come you are able to range query for example between
 start_column=('A', 'D') and end_column=('A','E') and get this column
 ('A','D','C')

 That's a slice query against columns, the column value is not hashed. The
 values of the column are sorted according to the comparator which can be
 different to the raw byte order.

 A range query is against rows. Rows keys are hashed (using the Random
 Partitioner) to create tokens, and are stored in token order.

 the composites are like chapters between the whole keys set, there must be
 intermediate keys added?

 Not sure what you mean.

 Cheers

   -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 1/06/2012, at 12:52 AM, Cyril Auburtin wrote:

 but sorry, I dont undertand

 If you hash 4 composite keys, let's say
 ('A','B','C'), ('A','D','C'), ('A','E','X'), ('A','R','X'), you have only 4
 hashes or you have more?

 If it's 4, how come you are able to range query for example between
 start_column=('A', 'D') and end_column=('A','E') and get this column
 ('A','D','C')

 the composites are like chapters between the whole keys set, there must be
 intermediate keys added?


 2012/5/31 aaron morton aa...@thelastpickle.com

 it is hashed once.

 To the partitioner it's just some bytes. Other parts of the code car
 about it's structure.

 Cheers

   -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 31/05/2012, at 7:00 PM, Cyril Auburtin wrote:

 Thx for the answer
 1 more thing, a Composite key is not hashed only once I guess?
 It's hashed the number of part the composite have?
 So this means there are twice or 3 or ... as many keys as for normal
 column keys, is it true?
 Le 31 mai 2012 02:59, aaron morton aa...@thelastpickle.com a écrit :

 Composite Columns compare each part in turn, so the values are ordered
 as you've shown them.

 However the rows are not ordered according to key value. They are
 ordered using the random token generated by the partitioner see
 http://wiki.apache.org/cassandra/FAQ#range_rp

 What is the real advantage compared to super column families?

 They are faster.

 Cheers

   -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 29/05/2012, at 10:08 PM, Cyril Auburtin wrote:

 How is it done in Cassandra to be able to range query on a composite key?

 key1 = (A:A:C), (A:B:C), (A:C:C), (A:D:C), (B,A,C)

 like get_range (key1, start_column=(A,), end_column=(A, C)); will
 return [ (A:B:C), (A:C:C) ] (in pycassa)

 I mean does the composite implementation add much overhead to make it
 work?
 Does it need to add other Column families, to be able to range query
 between composites simple keys (first, second and third part of the
 composite)?

 What is the real advantage compared to super column families?

 key1 = A: (A,C), (B,C), (C,C), (D,C)  , B: (A,C)

 thx








row_cache_provider = 'SerializingCacheProvider'

2012-06-01 Thread ruslan usifov
Hello

I begin use SerializingCacheProvider for rows cashing, and got
extremely JAVA heap grows. But i think that this cache provider
doesn't use JAVA heap


Re: How can we use composite indexes and secondary indexes together

2012-06-01 Thread Vivek Mishra
Have a look at Kundera (https://github.com/impetus-opensource/Kundera). It
does provide some sort of support (using Lucene) and allow you to deal with
association in JPA way.

-Vivek

On Fri, Jun 1, 2012 at 6:54 AM, aaron morton aa...@thelastpickle.comwrote:

 If you want to do arbitrary complex online / realtime queries look at Data
 Stax Enterprise, or https://github.com/tjake/Solandra or straight Solr.

 Alternatively denormalise the model to materialise the results when you
 insert so you query is a straight lookup. Or do some client side filtering
 / aggregation.

 If you want to do the queries offline, you can use Pig or Hive with Hadoop
 over Cassandra. The Apache Cassandra distro includes the pig support, hive
 is coming (i think) and there are Hadoop interfaces.  You can also look at
 Data Stax Enterprise.


 Cheers

 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 31/05/2012, at 11:07 PM, Nury Redjepow wrote:

 We want to use cassandra to store complex data. But we can't figure out,
 how to organize indexes.

 Our table (column family) looks like this:

 Users = { RandomId int, Firstname varchar, Lastname varchar, Age int,
 Country int, ChildCount int }

 In our queries we have mandatory fields (Firstname,Lastname,Age) and extra
 search options (Country,ChildCount). How do we organize index to make this
 kind of queries fast?

 First I thought, it would be natural to make composite index on
 (Firstname,Lastname,Age) and add separate secondary index on remaining
 fields (Country and ChildCount). But I can't insert rows into table after
 creating secondary indexes. And also, I can't query the table.

 I'm using cassandra 1.1.0, and cqlsh with --cql3 option.

 Any other suggestions to solve our problem (complex queries with mandatory
 and additional options) are welcome.
 The main point is, how can we join data in cassandra. If I make few index
 column families, I need to intersect the values, to get rows that pass all
 search criteria??? Or should I use something based on Hadoop (Pig,Hive) to
 make such queries?

 Respectfully, Nury

 --

 --

 --

 --





TimedOutException()

2012-06-01 Thread Oleg Dulin
We are using Cassandra 1.1.0 with an older Pelops version, but I don't 
think that in itself is a problem here.


I am getting this exception:

TimedOutException()
   at 
org.apache.cassandra.thrift.Cassandra$get_slice_result.read(Cassandra.java:7660) 

   at 
org.apache.cassandra.thrift.Cassandra$Client.recv_get_slice(Cassandra.java:570) 

   at 
org.apache.cassandra.thrift.Cassandra$Client.get_slice(Cassandra.java:542) 


   at org.scale7.cassandra.pelops.Selector$3.execute(Selector.java:683)
   at org.scale7.cassandra.pelops.Selector$3.execute(Selector.java:680)
   at org.scale7.cassandra.pelops.Operand.tryOperation(Operand.java:82)


Is my understanding correct that this is where cassandra is telling us 
it can't accomplish something within that timeout value -- as opposed 
to network timeout ? Where is it set ?


Thanks,
Oleg




Re: 1.1 not removing commit log files?

2012-06-01 Thread Rob Coli
On Thu, May 31, 2012 at 7:01 PM, aaron morton aa...@thelastpickle.com wrote:
 But that talks about segments not being cleared at startup. Does not explain
 why they were allowed to get past the limit in the first place.

Perhaps the commit log size tracking for this limit does not, for some
reason, track hints? This seems like the obvious answer given the
state which appears to trigger it? This doesn't explain why the files
aren't getting deleted after the hints are delivered, of course...

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Invalid Counter Shard errors?

2012-06-01 Thread Charles
Ok, will do. Thanks for the reply.

C

--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Invalid-Counter-Shard-errors-tp7580163p7580189.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Secondary Indexes, Quorum and Cluster Availability

2012-06-01 Thread Jim Ancona
Hi,

We have an application with two code paths, one of which uses a secondary
index query and the other, which doesn't. While testing node down scenarios
in our cluster we got a result which surprised (and concerned) me, and I
wanted to find out if the behavior we observed is expected.

Background:

   - 6 nodes in the cluster (in order: A, B, C, E, F and G)
   - RF = 3
   - All operations at QUORUM
   - Operation 1: Read by row key followed by write
   - Operation 2: Read by secondary index, followed by write

While running a mixed workload of operations 1 and 2, we got the following
results:

 * Scenario* * Result* All nodes up All operations succeed One node
downAll operations succeedNodes A and E downAll operations
succeedNodes A and B downOperation 1: ~33% fail
Operation 2: All fail Nodes A and C down Operation 1: ~17% fail
Operation 2: All fail
We had expected (perhaps incorrectly) that the secondary index reads would
fail in proportion to the portion of the ring that was unable to reach
quorum, just as the row key reads did. For both operation types the
underlying failure was an UnavailableException.

The same pattern repeated for the other scenarios we tried. The row key
operations failed at the expected ratios, given the portion of the ring
that was unable to meet quorum because of nodes down, while all the
secondary index reads failed as soon as 2 out of any 3 adjacent nodes were
down.

Is this an expected behavior? Is it documented anywhere? I didn't find it
with a quick search.

The operation doing secondary index query is an important one for our app,
and we'd really prefer that it degrade gracefully in the face of cluster
failures. My plan at this point is to do that query at ConsistencyLevel.ONE
(and accept the increased risk of inconsistency). Will that work?

Thanks in advance,

Jim


Re: TimedOutException()

2012-06-01 Thread Tyler Hobbs
On Fri, Jun 1, 2012 at 9:39 AM, Oleg Dulin oleg.du...@gmail.com wrote:


 Is my understanding correct that this is where cassandra is telling us it
 can't accomplish something within that timeout value -- as opposed to
 network timeout ? Where is it set ?


That's correct.  Basically, the coordinator sees that a replica has not
responded (or can not respond) before hitting a timeout.  This is
controlled by rpc_timeout_in_ms in cassandra.yaml.

-- 
Tyler Hobbs
DataStax http://datastax.com/


Re: tokens and RF for multiple phases of deployment

2012-06-01 Thread Chong Zhang
I followed the doc to add the new node. After the nodetool repair, the
'Load' on the new node in DC2 increased to 250M. But the 'Owns' col are
still 50%, 50%, 0%, and I guess it's OK because the new token value is 1?

Thanks,
Chong
On Thu, May 31, 2012 at 9:52 PM, aaron morton aa...@thelastpickle.comwrote:

 The ring (2 in DC1, 1 in DC2) looks OK, but the load on the new node in
 DC2 is almost 0%.

 yeah, thats the way it will look.

 But all the other rows are not in the new node. Do I need to copy the data
 files from a node in DC1 to the new node?

 How did you add the node ? (see
 http://www.datastax.com/docs/1.0/operations/cluster_management#adding-nodes-to-a-cluster
 )

 if in doubt run nodetool repair on the new node.

 Cheers


 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 1/06/2012, at 3:46 AM, Chong Zhang wrote:

 Thanks Aaron.

 I might use LOCAL_QUORUM to avoid the waiting on the ack from DC2.

 Another question, after I setup a new node with token +1 in a new DC,  and
 updated a CF with RF {DC1:2, DC2:1}. When i update a column on one node in
 DC1, it's also updated in the new node in DC2. But all the other rows are
 not in the new node. Do I need to copy the data files from a node in DC1 to
 the new node?

 The ring (2 in DC1, 1 in DC2) looks OK, but the load on the new node in
 DC2 is almost 0%.

 Address DC  RackStatus State   Load
  OwnsToken

  85070591730234615865843651857942052864
 10.10.10.1DC1 RAC1Up Normal  313.99 MB
 50.00%  0
 10.10.10.3DC2 RAC1Up Normal  7.07 MB
 0.00%   1
 10.10.10.2DC1 RAC1Up Normal  288.91 MB
 50.00%  85070591730234615865843651857942052864

 Thanks,
 Chong

 On Thu, May 31, 2012 at 5:48 AM, aaron morton aa...@thelastpickle.comwrote:


 Could you provide some guide on how to assign the tokens in this growing
 deployment phases?


 background
 http://www.datastax.com/docs/1.0/install/cluster_init#calculating-tokens-for-a-multi-data-center-cluster

 Start with tokens for a 4 node cluster. Add the next 4 between between
 each of the ranges. Add 8 in the new DC to have the same tokens as the
 first DC +1

 Also if we use the same RF (3) in both DC, and use EACH_QUORUM for write
 and LOCAL_QUORUM for read, can the read also reach to the 2nd cluster?

 No. It will fail if there are not enough nodes available in the first DC.

 We'd like to keep both write and read on the same cluster.

 Writes go to all replicas. Using EACH_QUORUM means the client in the
 first DC will be waiting for the quorum from the second DC to ack the
 write.


 Cheers
   -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 31/05/2012, at 3:20 AM, Chong Zhang wrote:

 Hi all,

 We are planning to deploy a small cluster with 4 nodes in one DC first,
 and will expend that to 8 nodes, then add another DC with 8 nodes for fail
 over (not active-active), so all the traffic will go to the 1st cluster,
 and switch to 2nd cluster if the whole 1st cluster is down or
 on maintenance.

 Could you provide some guide on how to assign the tokens in this growing
 deployment phases? I looked at some docs but not very clear on how to
 assign tokens on the fail-over case.
 Also if we use the same RF (3) in both DC, and use EACH_QUORUM for write
 and LOCAL_QUORUM for read, can the read also reach to the 2nd cluster?
 We'd like to keep both write and read on the same cluster.

 Thanks in advance,
 Chong







nodes moving spontaneously

2012-06-01 Thread Curt Allred
We have a 10 node cluster (v0.7.9) split into 2 datacenters.  Three times we 
have seen nodes move themselves to different locations in the ring.  In each 
case, the move unbalanced the ring. In one case a node moved to the opposite 
side of the ring.

Sometime after the first spontaneous move we started using Datastax OpsCenter.  
The next 2 moves showed up in its event log like:
5/20/2012 11:23am - Info -  Host 12.34.56.78 moved from '12345' to '54321'

where '12345' and '54321' are the old and new tokens.

Anyone know whats causing this?



Re: nodes moving spontaneously

2012-06-01 Thread Tyler Hobbs
OpsCenter just periodically calls describe_ring() on different nodes in the
cluster, so that's how it's getting that information.

Maybe try running nodetool ring on each node in your cluster to make sure
they all have the same view of the ring?

On Fri, Jun 1, 2012 at 4:01 PM, Curt Allred c...@mediosystems.com wrote:

 We have a 10 node cluster (v0.7.9) split into 2 datacenters.  Three times
 we have seen nodes move themselves to different locations in the ring.  In
 each case, the move unbalanced the ring. In one case a node moved to the
 opposite side of the ring.

 ** **

 Sometime after the first spontaneous move we started using Datastax
 OpsCenter.  The next 2 moves showed up in its event log like:

 5/20/2012 11:23am - Info -  Host 12.34.56.78 moved from '12345' to '54321'
 

 ** **

 where '12345' and '54321' are the old and new tokens.

 ** **

 Anyone know whats causing this?

 ** **




-- 
Tyler Hobbs
DataStax http://datastax.com/


Connecting Javaee server to Cassandra

2012-06-01 Thread xsdt
I have an existing javaee application running on JBoss-7 and using postgresql 
which I now want to replace with Cassandra-1.1. Hours of Internet search and I 
can't find any useful information, example or hint on how to connect Javaee 
server i.e JBoss to Cassandra (or might it even be necessary at all?) The whole 
thing makes me feel like I am doing something unique (cough).
 I am using JBoss, Cassandra, either Hector or Astyanax. Any relevant 
suggestions would be most welcomed.
 Thanks


Re: TimedOutException()

2012-06-01 Thread Oleg Dulin
Tyler Hobbs ty...@datastax.com wrote:
 On Fri, Jun 1, 2012 at 9:39 AM, Oleg Dulin oleg.du...@gmail.com wrote:
 
 Is my understanding correct that this is where cassandra is telling us it
 can't accomplish something within that timeout value -- as opposed to
 network timeout ? Where is it set ?
 
 That's correct.  Basically, the coordinator sees that a replica has not
 responded (or can not respond) before hitting a timeout.  This is
 controlled by rpc_timeout_in_ms in cassandra.yaml.
 
 --
 Tyler Hobbs
 DataStax a href=http://datastax.com/;http://datastax.com//a

So if we are using random partitioner, and read consistency of one, what
does that mean ? 

We have a 3 node cluster, use write / read consistency of one, replication
factor of 3. 

Is the node we are connecting to try to proxy requests ? Wouldn't our
configuration ensure all nodes have replicas ?