Re: [MASSMAIL]Re: Any use-case about a migration from SQL Server to Cassandra?

2015-06-24 Thread Paulo Ricardo Motta Gomes
https://labs.spotify.com/2015/06/23/user-database-switch/

On Wed, Jun 24, 2015 at 5:57 PM, Marcos Ortiz mlor...@uci.cu wrote:

  Where is the link, Carlos?


 On 24/06/15 07:18, Carlos Alonso wrote:

 This article from Spotify Labs is a really nice write up of migrating SQL
 (Postgres in this case) to Cassandra

  Carlos Alonso | Software Engineer | @calonso
 https://twitter.com/calonso

 On 23 June 2015 at 20:23, Alex Popescu al...@datastax.com wrote:


 On Tue, Jun 23, 2015 at 12:13 PM, Marcos Ortiz mlor...@uci.cu wrote:

 2- They used heavily C# in a Microsoft-based environment, so I need to
 know if the .Net driver is ready to use for production


 The DataStax C# driver has been used in production for quite a while by
 numerous users. It is the most up-to-date, feature rich, and
 tunable C# driver for Apache Cassandra and DataStax Enterprise.

  Anyways, if there's anything missing we are always happy to improve it.

  (as you can see from my sig, I do work for DataStax, but the above is
 very true)


  --
Bests,

 Alex Popescu | @al3xandru
 Sen. Product Manager @ DataStax



 --
 Marcos Ortiz http://about.me/marcosortiz, Sr. Product Manager (Data
 Infrastructure) at UCI
 @marcosluis2186 http://twitter.com/marcosluis2186





-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: connections remain on CLOSE_WAIT state after process is killed after upgrade to 2.0.15

2015-06-22 Thread Paulo Ricardo Motta Gomes
For the record: https://issues.apache.org/jira/browse/CASSANDRA-9630

On Mon, Jun 15, 2015 at 7:19 PM, Paulo Ricardo Motta Gomes 
paulo.mo...@chaordicsystems.com wrote:

 Just a quick update, I was able to fix the problem by reverting the patch
 CASSANDRA-8336 in our custom cassandra build. I don't know the root cause
 yet though. I will open a JIRA ticket and post here for reference later.

 On Fri, Jun 12, 2015 at 11:31 AM, Paulo Ricardo Motta Gomes 
 paulo.mo...@chaordicsystems.com wrote:

 Hello,

 We recently upgraded a cluster from 2.0.12 to 2.0.15 and now whenever we
 stop/kill a cassandra process, some other nodes keep a connection with the
 dead node in the CLOSE_WAIT state on port 7000 for about 5-20 minutes.

 So, if I start the killed node again, it cannot handshake with the nodes
 which have a connection on the CLOSE_WAIT state until that connection is
 closed, so they remain on the down state to each other for 5-20 minutes,
 until they can handshake again.

 I believe this is somehow related to the fixes CASSANDRA-8336 and
 CASSANDRA-9238, and also could be a duplicate of CASSANDRA-8072. I will
 continue to investigate to see if I find more evidences, but any help at
 this point would be appreciated, or at least a confirmation that it could
 be related to any of these tickets.

 Cheers,

 --
 *Paulo Motta*

 Chaordic | *Platform*
 *www.chaordic.com.br http://www.chaordic.com.br/*
 +55 48 3232.3200




 --
 *Paulo Motta*

 Chaordic | *Platform*
 *www.chaordic.com.br http://www.chaordic.com.br/*
 +55 48 3232.3200




-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


connections remain on CLOSE_WAIT state after process is killed after upgrade to 2.0.15

2015-06-12 Thread Paulo Ricardo Motta Gomes
Hello,

We recently upgraded a cluster from 2.0.12 to 2.0.15 and now whenever we
stop/kill a cassandra process, some other nodes keep a connection with the
dead node in the CLOSE_WAIT state on port 7000 for about 5-20 minutes.

So, if I start the killed node again, it cannot handshake with the nodes
which have a connection on the CLOSE_WAIT state until that connection is
closed, so they remain on the down state to each other for 5-20 minutes,
until they can handshake again.

I believe this is somehow related to the fixes CASSANDRA-8336 and
CASSANDRA-9238, and also could be a duplicate of CASSANDRA-8072. I will
continue to investigate to see if I find more evidences, but any help at
this point would be appreciated, or at least a confirmation that it could
be related to any of these tickets.

Cheers,

-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Why select returns tombstoned results?

2015-03-31 Thread Paulo Ricardo Motta Gomes
 What version of Cassandra are you running? Are you by any chance running
repairs on your data?

On Mon, Mar 30, 2015 at 5:39 PM, Benyi Wang bewang.t...@gmail.com wrote:

 Thanks for replying.

 In cqlsh, if I change to Quorum (Consistency quorum), sometime the select
 return the deleted row, sometime not.

 I have two virtual data centers: service (3 nodes) and analytics(4 nodes
 collocate with Hadoop data nodes).The table has 3 replicas in service and 2
 in analytics. When I wrote, I wrote into analytics using local_one. So I
 guest the data may not replicated to all nodes yet.

 I will try to use strong consistency for write.



 On Mon, Mar 30, 2015 at 11:59 AM, Prem Yadav ipremya...@gmail.com wrote:

 Increase the read CL to quorum and you should get correct results.
 How many nodes do you have in the cluster and what is the replication
 factor for the keyspace?

 On Mon, Mar 30, 2015 at 7:41 PM, Benyi Wang bewang.t...@gmail.com
 wrote:

 Create table tomb_test (
guid text,
content text,
range text,
rank int,
id text,
cnt int
primary key (guid, content, range, rank)
 )

 Sometime I delete the rows using cassandra java driver using this query

 DELETE FROM tomb_test WHERE guid=? and content=? and range=?

 in Batch statement with UNLOGGED. CONSISTENCE_LEVEL is local_one.

 But if I run

 SELECT * FROM tomb_test WHERE guid='guid-1' and content='content-1' and
 range='week'
 or
 SELECT * FROM tomb_test WHERE guid='guid-1' and content='content-1' and
 range='week' and rank = 1

 The result shows the deleted rows.

 If I run this select, the deleted rows are not shown

 SELECT * FROM tomb_test WHERE guid='guid-1' and content='content-1'

 If I run delete statement in cqlsh, the deleted rows won't show up.

 How can I fix this?






-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Should a node that is bootstrapping be receiving writes in addition to the streams it is receiving?

2015-03-02 Thread Paulo Ricardo Motta Gomes
I'm also facing a similar issue while bootstrapping a replacement node via
-Dreplace_address flag. The node is streaming data from neighbors, but
cfstats shows 0 counts for all metrics of all CFs in the bootstrapping node:

SSTable count: 0
SSTables in each level: [0, 0, 0, 0, 0, 0, 0, 0, 0]
Space used (live), bytes: 0
Space used (total), bytes: 0
SSTable Compression Ratio: 0.0
Number of keys (estimate): 0
Memtable cell count: 0
Memtable data size, bytes: 0
Memtable switch count: 0
Local read count: 0
Local read latency: 0.000 ms
Local write count: 0
Local write latency: 0.000 ms
Pending tasks: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.0
Bloom filter space used, bytes: 0
Compacted partition minimum bytes: 0
Compacted partition maximum bytes: 0
Compacted partition mean bytes: 0
Average live cells per slice (last five minutes): 0.0
Average tombstones per slice (last five minutes): 0.0

I also checked via JMX and all the write counts are zero. Is the node
supposed to receive writes during bootstrap?

The other funny thing during bootstrap, is that nodetool status shows that
the bootsrapping node is Up/Normal (UN), instead of Up/Joining(UJ), is this
expected or is it a bug? The bootstrapping node does not even appear in the
nodetool status of other nodes.

UN  X.Y.Z.244  15.9 GB1   3.7%
52fb21e-4621-4533-b201-8c1a7adbe818  rack

If I do a nodetool netstats, I see:

Mode: JOINING
Bootstrap 647d4b30-c11e-11e4-9249-173e73521fb44

Cheers,

Paulo

On Thu, Oct 16, 2014 at 3:53 PM, Robert Coli rc...@eventbrite.com wrote:

 On Wed, Oct 15, 2014 at 10:07 PM, Peter Haggerty 
 peter.hagge...@librato.com wrote:

 The node wrote gigs of data to various CFs during the bootstrap so it
 was clearly writing in some sense and it has the expected behavior
 after the bootstrap. Is cfstats correct when it reports that there
 were no writes during a bootstrap?


 As I understand it :

 Writes (extra writes, from the perspective of replication factor, f/e a
 RF=3 cluster has effective RF=4 during bootstrap, but not relevant for
 consistency purposes until end of bootstrap) occur via the storage protocol
 during bootstrap, so I would expect to see those reflected in cfstats.

 I'm relatively confident it is in fact receiving those writes, so your
 confusion might just be a result of how it's reported?

 =Rob
 http://twitter.com/rcolidba




-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: best supported spark connector for Cassandra

2015-02-13 Thread Paulo Ricardo Motta Gomes
I used to use calliope, which was really awesome before DataStax native
integration with Spark. Now I'm quite happy with the official DataStax
spark connector, it's very straightforward to use.

I never tried to use these drivers with Java though, I'd suggest you to use
them with Scala, which is the best option to write spark jobs.

On Fri, Feb 13, 2015 at 12:12 PM, Carlos Rolo r...@pythian.com wrote:

 Not for sure ;)

 If you need Cassandra support I can forward you to someone to talk to at
 Pythian.

 Regards,

 Regards,

 Carlos Juzarte Rolo
 Cassandra Consultant

 Pythian - Love your data

 rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
 http://linkedin.com/in/carlosjuzarterolo*
 Tel: 1649
 www.pythian.com

 On Fri, Feb 13, 2015 at 3:05 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
 mvallemil...@bloomberg.net wrote:

 Actually, I am not the one looking for support, but I thank you a lot
 anyway.
 But from your message I guess the answer is yes, Datastax is not the only
 Cassandra vendor offering support and changing official Cassandra source at
 this moment, is this right?

 From: user@cassandra.apache.org
 Subject: Re: best supported spark connector for Cassandra

 Of course, Stratio Deep and Stratio Cassandra are licensed  Apache 2.0.

 Regarding the Cassandra support, I can introduce you to someone in
 Stratio that can help you.

 2015-02-12 15:05 GMT+01:00 Marcelo Valle (BLOOMBERG/ LONDON) 
 mvallemil...@bloomberg.net:

 Thanks for the hint Gaspar.
 Do you know if Stratio Deep / Stratio Cassandra are also licensed Apache
 2.0?

 I had interest in knowing more about Stratio when I was working on a
 start up. Now, on a blueship, it seems one of the hardest obstacles to use
 Cassandra in a project is the need of an area supporting it, and it seems
 people are specially concerned about how many vendors an open source
 solution has to provide support.

 This seems to be kind of an advantage of HBase, as there are many
 vendors supporting it, but I wonder if Stratio can be considered an
 alternative to Datastax reggarding Cassandra support?

 It's not my call here to decide anything, but as part of the community
 it helps to have this business scenario clear. I could say Cassandra could
 be the best fit technical solution for some projects but sometimes
 non-technical factors are in the game, like this need for having more than
 one vendor available...


 From: gmu...@stratio.com
 Subject: Re: best supported spark connector for Cassandra

 My suggestion is to use Java or Scala instead of Python. For Java/Scala
 both the Datastax and Stratio drivers are valid and similar options. As far
 as I know they both take care about data locality and are not based on the
 Hadoop interface. The advantage of Stratio Deep is that allows you to
 integrate Spark not only with Cassandra but with MongoDB, Elasticsearch,
 Aerospike and others as well.
 Stratio has a forked Cassandra for including some additional features
 such as Lucene based secondary indexes. So Stratio driver works fine with
 the Apache Cassandra and also with their fork.

 You can find some examples of using Deep here:
 https://github.com/Stratio/deep-examples  Please if you need some help
 with Stratio Deep do not hesitate to contact us.


 2015-02-11 17:18 GMT+01:00 shahab shahab.mok...@gmail.com:

 I am using Calliope cassandra-spark connector(
 http://tuplejump.github.io/calliope/), which is quite handy and easy
 to use!
 The only problem is that it is a bit outdates , works with Spark 1.1.0,
 hopefully new version comes soon.

 best,
 /Shahab

 On Wed, Feb 11, 2015 at 2:51 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
 mvallemil...@bloomberg.net wrote:

 I just finished a scala course, nice exercise to check what I learned
 :D

 Thanks for the answer!

 From: user@cassandra.apache.org
 Subject: Re: best supported spark connector for Cassandra

 Start looking at the Spark/Cassandra connector here (in Scala):
 https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector

 Data locality is provided by this method:
 https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L329-L336

 Start digging from this all the way down the code.

 As for Stratio Deep, I can't tell how the did the integration with
 Spark. Take some time to dig down their code to understand the logic.



 On Wed, Feb 11, 2015 at 2:25 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
 mvallemil...@bloomberg.net wrote:

 Taking the opportunity Spark was being discussed in another thread, I
 decided to start a new one as I have interest in using Spark + Cassandra 
 in
 the feature.

 About 3 years ago, Spark was not an existing option and we tried to
 use hadoop to process Cassandra data. My experience was horrible and we
 reached the conclusion it was faster to develop an internal tool than
 insist on 

Re: Database schema migration

2015-01-29 Thread Paulo Ricardo Motta Gomes
Hello José,

There isn't yet an officially supported way to perform schema migrations
afaik, but there are quite a few tools on github that perform migrations
either from within the application, or external tools. We currently use
this tool to perform migrations embedded in the application:
https://github.com/fromanator/mutagen-cassandra

You may find other options in the mail list archives.

Cheers,

On Thu, Jan 29, 2015 at 8:31 AM, José Guilherme Vanz 
guilherme@gmail.com wrote:

 Hello

 I am studying Cassandra for while and to practice the libraries and
 concepts I will implement a simple Cassandra client. During my research I
 faced a doubt about schema migrations. What the common/best practice in
 production clusters? I mean, who actually make the schema migration? The
 application or the cluster mananger have to update the schema before update
 the application?

 All the best
 Vanz




-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Database schema migration

2015-01-29 Thread Paulo Ricardo Motta Gomes
This might be of interest (you probably have already found it):
http://grokbase.com/t/cassandra/user/14bs9zvasf/cassandra-schema-migrator

On Thu, Jan 29, 2015 at 9:16 AM, José Guilherme Vanz 
guilherme@gmail.com wrote:

 Hi, Ricardo

 Thank you for your quick reply. =]
 I'll take a look in the mutagen-cassandra and others I find in the archives

 All the best


 On Thu, Jan 29, 2015 at 8:38 AM, Paulo Ricardo Motta Gomes 
 paulo.mo...@chaordicsystems.com wrote:

 Hello José,

 There isn't yet an officially supported way to perform schema migrations
 afaik, but there are quite a few tools on github that perform migrations
 either from within the application, or external tools. We currently use
 this tool to perform migrations embedded in the application:
 https://github.com/fromanator/mutagen-cassandra

 You may find other options in the mail list archives.

 Cheers,

 On Thu, Jan 29, 2015 at 8:31 AM, José Guilherme Vanz 
 guilherme@gmail.com wrote:

 Hello

 I am studying Cassandra for while and to practice the libraries and
 concepts I will implement a simple Cassandra client. During my research I
 faced a doubt about schema migrations. What the common/best practice in
 production clusters? I mean, who actually make the schema migration? The
 application or the cluster mananger have to update the schema before update
 the application?

 All the best
 Vanz




 --
 *Paulo Motta*

 Chaordic | *Platform*
 *www.chaordic.com.br http://www.chaordic.com.br/*
 +55 48 3232.3200




 --
 Att. José Guilherme Vanz
 br.linkedin.com/pub/josé-guilherme-vanz/51/b27/58b/
 http://br.linkedin.com/pub/jos%C3%A9-guilherme-vanz/51/b27/58b/
 O sofrimento é passageiro, desistir é para sempre - Bernardo Fonseca,
 recordista da Antarctic Ice Marathon.




-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: get partition key from tombstone warnings?

2015-01-22 Thread Paulo Ricardo Motta Gomes
Yep, you may register and log into the Apache JIRA and click Vote for this
issue, in the upper right-side of the ticket.

On Wed, Jan 21, 2015 at 11:30 PM, Ian Rose ianr...@fullstory.com wrote:

 Ah, thanks for the pointer Philip.  Is there any kind of formal way to
 vote up issues?  I'm assuming that adding a comment of +1 or the like
 is more likely to be *counter*productive.

 - Ian


 On Wed, Jan 21, 2015 at 5:02 PM, Philip Thompson 
 philip.thomp...@datastax.com wrote:

 There is an open ticket for this improvement at
 https://issues.apache.org/jira/browse/CASSANDRA-8561

 On Wed, Jan 21, 2015 at 4:55 PM, Ian Rose ianr...@fullstory.com wrote:

 When I see a warning like Read 9 live and 5769 tombstoned cells in ...
 etc is there a way for me to see the partition key that this query was
 operating on?

 The description in the original JIRA ticket (
 https://issues.apache.org/jira/browse/CASSANDRA-6042) reads as though
 exposing this information was one of the original goals, but it isn't
 obvious to me in the logs...

 Cheers!
 - Ian






-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Reload/resync system.peers table

2014-12-17 Thread Paulo Ricardo Motta Gomes
Hello,

Due to CASSANDRA-6053 there are lots of ghost nodes on the system.peers
table, because decommisioned nodes were not properly removed from this
table.

Is there any automatic way of reloading/resyncing the system.peers table?
Or the only way is by removing ghost nodes?

Tried to restart the node with -Dcassandra.load_ring_state=false, but
didn't work.

Cheers,

Paulo

-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Nodes get stuck in crazy GC loop after some time, leading to timeouts

2014-12-03 Thread Paulo Ricardo Motta Gomes
Thanks a lot for the help Graham and Robert! Will try increasing heap and
see how it goes.

Here are my gc settings, if they're still helpful (they're mostly the
defaults):

-Xms6G -Xmx6G -Xmn400M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB
-XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways

On Wed, Dec 3, 2014 at 2:17 AM, Jason Wee peich...@gmail.com wrote:

 ack and many thanks for the tips and help..

 jason

 On Wed, Dec 3, 2014 at 4:49 AM, Robert Coli rc...@eventbrite.com wrote:

 On Mon, Dec 1, 2014 at 11:07 PM, Jason Wee peich...@gmail.com wrote:

 Hi Rob, any recommended documentation on describing
 explanation/configuration of the JVM heap and permanent generation ? We
 stucked in this same situation too. :(


 The archives of this list are chock full of explorations of various
 cases. Your best bet is to look for a good Aaron Morton reference where he
 breaks down the math between generations.

 I swear there was a blog post of his on this subject, but the best I can
 find is this slidedeck :

 http://www.slideshare.net/aaronmorton/cassandra-tk-2014-large-nodes

 =Rob





-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Nodes get stuck in crazy GC loop after some time, leading to timeouts

2014-11-28 Thread Paulo Ricardo Motta Gomes
Hello,

This is a recurrent behavior of JVM GC in Cassandra that I never completely
understood: when a node is UP for many days (or even months), or receives a
very high load spike (3x-5x normal load), CMS GC pauses start becoming very
frequent and slow, causing periodic timeouts in Cassandra. Trying to run GC
manually doesn't free up memory. The only solution when a node reaches this
state is to restart the node.

We restart the whole cluster every 1 or 2 months, to avoid machines getting
into this crazy state. We tried tuning GC size and parameters, different
cassandra versions (1.1, 1.2, 2.0), but this behavior keeps happening. More
recently, during black friday, we received about 5x our normal load, and
some machines started presenting this behavior. Once again, we restart the
nodes an the GC behaves normal again.

I'm attaching a few pictures comparing the heap of healthy and sick
nodes: http://imgur.com/a/Tcr3w

You can clearly notice some memory is actually reclaimed during GC in
healthy nodes, while in sick machines very little memory is reclaimed.
Also, since GC is executed more frequently in sick machines, it uses about
2x more CPU than non-sick nodes.

Have you ever observed this behavior in your cluster? Could this be related
to heap fragmentation? Would using the G1 collector help in this case? Any
GC tuning or monitoring advice to troubleshoot this issue?

Any advice or pointers will be kindly appreciated.

Cheers,

-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Cassandra COPY to CSV and DateTieredCompactionStrategy

2014-11-27 Thread Paulo Ricardo Motta Gomes
Regarding the first question you need to configure your application to
write to both CFs (old and new) during the migration phase.

I'm not sure about the second question, but my guess is that only the
writeTime will be taken into account.

On Thu, Nov 27, 2014 at 10:54 AM, Batranut Bogdan batra...@yahoo.com
wrote:

 Hello all,

 I have a few things that I need to understand.

 1 . Here is the scenario:
 we have a HUGE cf where there are daily writes it is like a time series.
 Now we want to change the type of a column in primary key. What I think we
 can do is to export to csv, create the new table and write back the
 transformed data. But here is the catch... the constant writes in the cf. I
 assume that by the time the export finishes, new data will be inserted in
 the source cf. So is there a tool that will export data without having to
 stop the writes?

 2. I have seen that there is a new compaction strategy: DTCS, that will
 better fit historical data. This compaction strategy will take into account
 writeTime() of an entry or will it be smart enough and detect that the
 column family is a time series and take into account those timestamps when
 creating the time windows? I am asking this since when we write to the cf,
 the time for a particular record is 00:00h of a given day, so basicaly all
 entries have the same timestamp value in the cf but of course different
 writeTime() .




-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Repair/Compaction Completion Confirmation

2014-11-21 Thread Paulo Ricardo Motta Gomes
Hey guys,

Just reviving this thread. In case anyone is using the
cassandra_range_repair tool (https://github.com/BrianGallew/cassandra_range_
repair), please sync your repositories because the tool was not working
before due to a critical bug on the token range definition method. For more
information on the bug please check here:
https://github.com/BrianGallew/cassandra_range_repair/pull/18

Cheers,

On Tue, Oct 28, 2014 at 7:53 AM, Colin co...@clark.ws wrote:

 When I use virtual nodes, I typically use a much smaller number - usually
 in the range of 10.  This gives me the ability to add nodes easier without
 the performance hit.



 --
 *Colin Clark*
 +1-320-221-9531


 On Oct 28, 2014, at 10:46 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:

 I have been trying this yesterday too.

 https://github.com/BrianGallew/cassandra_range_repair

 Not 100% bullet proof -- Indeed I found that operations are done
 multiple times, so it is not very optimised. Though it is open sourced so
 I guess you can improve things as much as you want and contribute. Here is
 the issue I raised yesterday
 https://github.com/BrianGallew/cassandra_range_repair/issues/14.

 I am also trying to improve our repair automation since we now have
 multiple DC and up to 800 GB per node. Repairs are quite heavy right now.

 Good luck,

 Alain

 2014-10-28 4:59 GMT+01:00 Ben Bromhead b...@instaclustr.com:

 https://github.com/BrianGallew/cassandra_range_repair

 This breaks down the repair operation into very small portions of the
 ring as a way to try and work around the current fragile nature of repair.

 Leveraging range repair should go some way towards automating repair
 (this is how the automatic repair service in DataStax opscenter works, this
 is how we perform repairs).

 We have had a lot of success running repairs in a similar manner against
 vnode enabled clusters. Not 100% bullet proof, but way better than nodetool
 repair



 On 28 October 2014 08:32, Tim Heckman t...@pagerduty.com wrote:

 On Mon, Oct 27, 2014 at 1:44 PM, Robert Coli rc...@eventbrite.com
 wrote:

 On Mon, Oct 27, 2014 at 1:33 PM, Tim Heckman t...@pagerduty.com wrote:

 I know that when issuing some operations via nodetool, the command
 blocks until the operation is finished. However, is there a way to 
 reliably
 determine whether or not the operation has finished without monitoring 
 that
 invocation of nodetool?

 In other words, when I run 'nodetool repair' what is the best way to
 reliably determine that the repair is finished without running something
 equivalent to a 'pgrep' against the command I invoked? I am curious about
 trying to do the same for major compactions too.


 This is beyond a FAQ at this point, unfortunately; non-incremental
 repair is awkward to deal with and probably impossible to automate.

 In The Future [1] the correct solution will be to use incremental
 repair, which mitigates but does not solve this challenge entirely.

 As brief meta commentary, it would have been nice if the project had
 spent more time optimizing the operability of the critically important
 thing you must do once a week [2].

 https://issues.apache.org/jira/browse/CASSANDRA-5483

 =Rob
 [1] http://www.datastax.com/dev/blog/anticompaction-in-cassandra-2-1
 [2] Or, more sensibly, once a month with gc_grace_seconds set to 34
 days.


 Thank you for getting back to me so quickly. Not the answer that I was
 secretly hoping for, but it is nice to have confirmation. :)

 Cheers!
 -Tim




 --

 Ben Bromhead

 Instaclustr | www.instaclustr.com | @instaclustr
 http://twitter.com/instaclustr | +61 415 936 359





-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Increase in dropped mutations after major upgrade from 1.2.18 to 2.0.10

2014-11-10 Thread Paulo Ricardo Motta Gomes
Hey,

We've seen a considerable increase in the number of dropped mutations after
a major upgrade from 1.2.18 to 2.0.10. I initially thought it was due to
the extra load incurred by upgradesstables, but the dropped mutations
continue even after all sstables are upgraded.

Additional info: Overall (read, write and range) latency improved with the
upgrade, which is great, but I don't understand why dropped mutations has
increased. I/O and CPU load is pretty much the same, number of completed
tasks is the only metric that increased together with dropped mutations.

I also noticed that the number of all time blocked FlushWriter operations
is about 5% of completed operations, don't know if this is related, but in
case it helps out...

Anyone has a clue on what could that be? Or what should we monitor to find
out? Any help or JIRA pointers would be kindly appreciated.

Cheers,

-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Increase in dropped mutations after major upgrade from 1.2.18 to 2.0.10

2014-11-10 Thread Paulo Ricardo Motta Gomes
On Mon, Nov 10, 2014 at 12:46 PM, Duncan Sands duncan.sa...@gmail.com
wrote:

 Hi Paulo,

 On 10/11/14 15:18, Paulo Ricardo Motta Gomes wrote:

 Hey,

 We've seen a considerable increase in the number of dropped mutations
 after a
 major upgrade from 1.2.18 to 2.0.10. I initially thought it was due to
 the extra
 load incurred by upgradesstables, but the dropped mutations continue even
 after
 all sstables are upgraded.


 are the clocks on all your nodes synchronized with each other?

 Ciao, Duncan.


Yes, the servers are synchronized via NTP.

Cheers!




 Additional info: Overall (read, write and range) latency improved with the
 upgrade, which is great, but I don't understand why dropped mutations has
 increased. I/O and CPU load is pretty much the same, number of completed
 tasks
 is the only metric that increased together with dropped mutations.

 I also noticed that the number of all time blocked FlushWriter
 operations is
 about 5% of completed operations, don't know if this is related, but in
 case it
 helps out...

 Anyone has a clue on what could that be? Or what should we monitor to
 find out?
 Any help or JIRA pointers would be kindly appreciated.

 Cheers,

 --
 *Paulo Motta*

 Chaordic | /Platform/
 _www.chaordic.com.br http://www.chaordic.com.br/_
 +55 48 3232.3200





-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: efficiently generate complete database dump in text format

2014-10-09 Thread Paulo Ricardo Motta Gomes
The best way to generate dumps from Cassandra is via Hadoop integration (or
spark). You can find more info here:

http://www.datastax.com/documentation/cassandra/2.1/cassandra/configuration/configHadoop.html
http://wiki.apache.org/cassandra/HadoopSupport

On Thu, Oct 9, 2014 at 4:19 AM, Gaurav Bhatnagar gbhatna...@gmail.com
wrote:

 Hi,
We have a Cassandra database column family containing 320 millions rows
 and each row contains about 15 columns. We want to take monthly dump of
 this single column family contained in this database in text format.

 We are planning to take following approach to implement this functionality
 1. Take a snapshot of Cassandra database using nodetool utility. We
 specify -cf flag to
  specify column family name so that snapshot contains data
 corresponding to a single
  column family.
 2. We take backup of this snapshot and move this backup to a separate
 physical machine.
 3. We using SStable to json conversion utility to json convert all the
 data files into json
 format.

 We have following questions/doubts regarding the above approach
 a) Generated json records contains d (IS_MARKED_FOR_DELETE) flag in json
 record
  and can I safely ignore all such json records?
 b) If I ignore all records marked by d flag, than can generated json
 files in step 3, contain
 duplicate records? I mean do multiple entries for same key.

 Do there can be any other better approach to generate data dumps in text
 format.

 Regards,
 Gaurav




-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Multi-DC Repairs and Token Questions

2014-10-07 Thread Paulo Ricardo Motta Gomes
This related issue might be of interest:
https://issues.apache.org/jira/browse/CASSANDRA-7450

In 1.2 -pr option does make cross DC repairs, but you must ensure that
all nodes from all datacenter execute repair, otherwise some ranges will be
missing. This fix enables -pr and -local together, which was disabled in
2.0 because it didn't work (it also does not work in 1.2).

On Tue, Oct 7, 2014 at 5:46 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:

 Hi guys, sorry about digging this up, but, is this bug also affecting
 1.2.x versions ? I can't see this being backported to 1.2 on the Jira. Was
 this bug introduced in 2.0 ?

 Anyway, how does nodetool repair -pr behave on a multi DC env, does it
 make cross DC repairs or not ? Should we remove the pr option in a multi
 DC context to remove entropy between DCs ? I mean a repair -pr is supposed
 to repair the primary range for the current node, does it also repair
 corresponding primary range in other DCs ?

 Thanks for insight around this.

 2014-06-03 8:06 GMT+02:00 Nick Bailey n...@datastax.com:

 See https://issues.apache.org/jira/browse/CASSANDRA-7317


 On Mon, Jun 2, 2014 at 8:57 PM, Matthew Allen matthew.j.al...@gmail.com
 wrote:

 Hi Rameez, Chovatia, (sorry I initially replied to Dwight individually)

 SN_KEYSPACE and MY_KEYSPACE are just typos (was try to mask out
 identifiable information), they are same keyspace.

 Keyspace: SN_KEYSPACE:
   Replication Strategy:
 org.apache.cassandra.locator.NetworkTopologyStrategy
   Durable Writes: true
 Options: [DC_VIC:2, DC_NSW:2]

 In a nutshell, replication is working as expected, I'm just confused
 about token range assignments in a Multi-DC environment and how repairs
 should work

 From
 http://www.datastax.com/documentation/cassandra/1.2/cassandra/configuration/configGenTokens_c.html,
 it specifies

 *Multiple data center deployments: calculate the tokens for
 each data center so that the hash range is evenly divided for the nodes in
 each data center*

 Given that nodetool -repair isn't multi-dc aware, in our production 18
 node cluster (9 nodes in each DC), which of the following token ranges
 should be used (Murmur3 Partitioner) ?

 Token range divided evenly over the 2 DC's/18 nodes as below ?

 Node DC_NSWDC_VIC
 1'-9223372036854775808''-8198552921648689608'
 2'-7173733806442603408''-6148914691236517208'
 3'-5124095576030431008''-4099276460824344808'
 4'-3074457345618258608''-2049638230412172408'
 5'-1024819115206086208''-8'
 6'1024819115206086192' '2049638230412172392'
 7'3074457345618258592' '4099276460824344792'
 8'5124095576030430992' '6148914691236517192'
 9'7173733806442603392' '8198552921648689592'

 Or An offset used for DC_VIC (i.e. DC_NSW + 100) ?

 Node DC_NSW DC_VIC
 1 '-9223372036854775808''-9223372036854775708'
 2 '-7173733806442603407''-7173733806442603307'
 3 '-5124095576030431006''-5124095576030430906'
 4 '-3074457345618258605''-3074457345618258505'
 5 '-1024819115206086204''-1024819115206086104'
 6 '1024819115206086197' '1024819115206086297'
 7 '3074457345618258598' '3074457345618258698'
 8 '5124095576030430999' '5124095576030431099'
 9 '7173733806442603400' '7173733806442603500'

 It's too late for me to switch to vnodes, hope that makes sense, thanks

 Matt



 On Thu, May 29, 2014 at 12:01 AM, Rameez Thonnakkal ssram...@gmail.com
 wrote:

 as Chovatia mentioned, the keyspaces seems to be different.
 try Describe keyspace SN_KEYSPACE and describe keyspace MY_KEYSPACE
 from CQL.
 This will give you an idea about how many replicas are there for these
 keyspaces.



 On Wed, May 28, 2014 at 11:49 AM, chovatia jaydeep 
 chovatia_jayd...@yahoo.co.in wrote:

 What is your partition type? Is
 it org.apache.cassandra.dht.Murmur3Partitioner?
 In your repair command i do see there are two different KeySpaces 
 MY_KEYSPACE
 and SN_KEYSPACE, are these two separate key spaces or typo?

 -jaydeep


   On Tuesday, 27 May 2014 10:26 PM, Matthew Allen 
 matthew.j.al...@gmail.com wrote:


 Hi,

 Am a bit confused regarding data ownership in a multi-dc environment.

 I have the following setup in a test cluster with a keyspace with
 (placement_strategy = 'NetworkTopologyStrategy' and strategy_options =
 {'DC_NSW':2,'DC_VIC':2};)

 Datacenter: DC_NSW
 ==
 Replicas: 2
 Address RackStatus State   Load
 OwnsToken

 0
 nsw1  rack1   Up Normal  1007.43 MB  100.00%
 -9223372036854775808
 nsw2  rack1   Up Normal  1008.08 MB  100.00% 0


 Datacenter: DC_VIC
 ==
 Replicas: 2
 Address RackStatus State   Load
 OwnsToken

 100
 vic1   rack1   Up Normal  1015.1 MB   100.00%
 -9223372036854775708
 vic2   rack1   Up Normal  1015.13 MB  100.00%
 100

 My understanding is that both 

backport of CASSANDRA-6916

2014-09-16 Thread Paulo Ricardo Motta Gomes
Hello,

Has anyone backported incremental replacement of compacted SSTables
(CASSANDRA-6916) to 2.0? Is it doable or there are many dependencies
introduced in 2.1?

Haven't checked the ticket detail yet, but just in case anyone has
interesting info to share.

Cheers,

-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: backport of CASSANDRA-6916

2014-09-16 Thread Paulo Ricardo Motta Gomes
own purposes but wouldn't mind making it public so people could patch it
themselves if they want too.. (if nobody has already done so) :)

On Tue, Sep 16, 2014 at 8:13 PM, Robert Coli rc...@eventbrite.com wrote:

 On Tue, Sep 16, 2014 at 2:56 PM, Paulo Ricardo Motta Gomes 
 paulo.mo...@chaordicsystems.com wrote:

 Has anyone backported incremental replacement of compacted SSTables
 (CASSANDRA-6916) to 2.0? Is it doable or there are many dependencies
 introduced in 2.1?

 Haven't checked the ticket detail yet, but just in case anyone has
 interesting info to share.


 Are you looking to patch for public consumption, or for your own purposes?

 I just took the temperature of #cassandra-dev and they were cold on the
 idea as a public patch, because of potential impact on stability.

 =Rob





-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: backport of CASSANDRA-6916

2014-09-16 Thread Paulo Ricardo Motta Gomes
Because I want this specific feature, and not all 2.1 features, even though
this is probably one of the most significant changes in 2.1. Upgrading
would be nice, but want to wait a little more before fully jumping into 2.1
:)

We're having sudden peaks on read latency some time after a massive batch
write which is mostly likely caused by cold page cache of newly compacted
sstables, which will hopefully be solved by this.

On Tue, Sep 16, 2014 at 8:25 PM, James Briggs james.bri...@yahoo.com
wrote:

 Paulo:

 Out of curiosity, why not just upgrade to 2.1 if you want the new features?

 You know you want to! :)


 Thanks, James Briggs
 --
 Cassandra/MySQL DBA. Available in San Jose area or remote.


   --
  *From:* Robert Coli rc...@eventbrite.com
 *To:* user@cassandra.apache.org user@cassandra.apache.org
 *Sent:* Tuesday, September 16, 2014 4:13 PM
 *Subject:* Re: backport of CASSANDRA-6916

 On Tue, Sep 16, 2014 at 2:56 PM, Paulo Ricardo Motta Gomes 
 paulo.mo...@chaordicsystems.com wrote:

 Has anyone backported incremental replacement of compacted SSTables
 (CASSANDRA-6916) to 2.0? Is it doable or there are many dependencies
 introduced in 2.1?

 Haven't checked the ticket detail yet, but just in case anyone has
 interesting info to share.


 Are you looking to patch for public consumption, or for your own purposes?

 I just took the temperature of #cassandra-dev and they were cold on the
 idea as a public patch, because of potential impact on stability.

 =Rob






-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Quickly loading C* dataset into memory (row cache)

2014-09-14 Thread Paulo Ricardo Motta Gomes
Apparently Apple is using Cassandra as a massive multi-DC cache, as per
their announcement during the summit, but probably DSE with in-memory
enabled option. Would love to hear about similar use cases.

On Fri, Sep 12, 2014 at 12:20 PM, Ken Hancock ken.hanc...@schange.com
wrote:

 +1 for Redis.

 It's really nice, good primitives, and then you can do some really cool
 stuff chaining multiple atomic operations to create larger atomics through
 the lua scripting.

 On Thu, Sep 11, 2014 at 12:26 PM, Robert Coli rc...@eventbrite.com
 wrote:

 On Thu, Sep 11, 2014 at 8:30 AM, Danny Chan tofuda...@gmail.com wrote:

 What are you referring to when you say memory store?

 RAM disk? memcached?


 In 2014, probably Redis?

 =Rob





 --
 *Ken Hancock *| System Architect, Advanced Advertising
 SeaChange International
 50 Nagog Park
 Acton, Massachusetts 01720
 ken.hanc...@schange.com | www.schange.com | NASDAQ:SEAC
 http://www.schange.com/en-US/Company/InvestorRelations.aspx
 Office: +1 (978) 889-3329 | [image: Google Talk:] ken.hanc...@schange.com
  | [image: Skype:]hancockks | [image: Yahoo IM:]hancockks[image: LinkedIn]
 http://www.linkedin.com/in/kenhancock

 [image: SeaChange International]
 http://www.schange.com/This e-mail and any attachments may contain
 information which is SeaChange International confidential. The information
 enclosed is intended only for the addressees herein and may not be copied
 or forwarded without permission from SeaChange International.




-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Too many SSTables after rebalancing cluster (LCS)

2014-08-29 Thread Paulo Ricardo Motta Gomes
Deleting the json manifest worked like a charm. After 2 days of compactions
I've got 50GB extra space! :)

Just a quick addendum, after deleting the json metadata file, I needed to
restart the node, otherwise it just reloads the file from memory.

Version: 1.2.16

On Wed, Aug 27, 2014 at 8:13 PM, Robert Coli rc...@eventbrite.com wrote:

 On Wed, Aug 27, 2014 at 3:27 PM, Nate McCall n...@thelastpickle.com
 wrote:

 Another option to force things - deleting the json metadata file for that
 table will cause LCS to put all SSTables in level 0 and begin recompacting
 them.


 That's possible in versions where the level is in a JSON file, which is
 versions before 2.0. In 2.0+ you can use nodetool for the same purpose.

 https://issues.apache.org/jira/browse/CASSANDRA-5271 (Fixed; 2.0 beta 1):
 Create tool to drop sstables to level 0

 =Rob




-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Too many SSTables after rebalancing cluster (LCS)

2014-08-27 Thread Paulo Ricardo Motta Gomes
Great idea, will try that (right now is 10%, but being more aggressive
should hopefully work).

Cheers!


On Wed, Aug 27, 2014 at 7:02 PM, Nate McCall n...@thelastpickle.com wrote:

 Try turning down 'tombstone_threshold' to something like '0.05' from it's
 default of '0.2.' This will cause the SSTable to be considered for
 tombstone only compactions more frequently (if %5 of the columns are
 tombstones instead of 20%).

 For a bit more info, see:

 http://www.datastax.com/documentation/cql/3.0/cql/cql_reference/compactSubprop.html


 On Tue, Aug 26, 2014 at 1:38 PM, Paulo Ricardo Motta Gomes 
 paulo.mo...@chaordicsystems.com wrote:

 Hey folks,

 After adding more nodes and moving tokens of old nodes to rebalance the
 ring, I noticed that the old nodes had significant more data then the
 newly bootstrapped nodes, even after cleanup.

 I noticed that the old nodes had a much larger number of SSTables on LCS
 CFs, and most of them located on the last level:

 Node N-1 (old node): [1, 10, 102/100, 173, 2403, 0, 0, 0, 0] (total:2695)

 *Node N (new node): [1, 10, 108/100, 214, 0, 0, 0, 0, 0] (total: 339)*Node
 N+1 (old node): [1, 10, 87, 113, 1076, 0, 0, 0, 0] (total: 1287)

 Since these sstables have a lot of tombstones, and they're not updated
 frequently, they remain in the last level forever, and are never cleaned.

 What is the solution here? The good old change to STCS and then back to
 LCS, or is there something less brute force?

 Environment: Cassandra 1.2.16 - non-vnondes

 Any help would be very much appreciated.

 Cheers,

 --
 *Paulo Motta*

 Chaordic | *Platform*
 *www.chaordic.com.br http://www.chaordic.com.br/*
 +55 48 3232.3200




 --
 -
 Nate McCall
 Austin, TX
 @zznate

 Co-Founder  Sr. Technical Consultant
 Apache Cassandra Consulting
 http://www.thelastpickle.com




-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Too many SSTables after rebalancing cluster (LCS)

2014-08-26 Thread Paulo Ricardo Motta Gomes
Hey folks,

After adding more nodes and moving tokens of old nodes to rebalance the
ring, I noticed that the old nodes had significant more data then the
newly bootstrapped nodes, even after cleanup.

I noticed that the old nodes had a much larger number of SSTables on LCS
CFs, and most of them located on the last level:

Node N-1 (old node): [1, 10, 102/100, 173, 2403, 0, 0, 0, 0] (total:2695)

*Node N (new node): [1, 10, 108/100, 214, 0, 0, 0, 0, 0] (total: 339)*Node
N+1 (old node): [1, 10, 87, 113, 1076, 0, 0, 0, 0] (total: 1287)

Since these sstables have a lot of tombstones, and they're not updated
frequently, they remain in the last level forever, and are never cleaned.

What is the solution here? The good old change to STCS and then back to
LCS, or is there something less brute force?

Environment: Cassandra 1.2.16 - non-vnondes

Any help would be very much appreciated.

Cheers,

-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: EC2 SSD cluster costs

2014-08-19 Thread Paulo Ricardo Motta Gomes
Still using good ol' m1.xlarge here + external caching (memcached). Trying
to adapt our use case to have different clusters for different use cases so
we can leverage SSD at an acceptable cost in some of them.


On Tue, Aug 19, 2014 at 1:05 PM, Shane Hansen shanemhan...@gmail.com
wrote:

 Again, depends on your use case.
 But we wanted to keep the data per node below 500gb,
 and we found raided ssds to be the best bang for the buck
 for our cluster. I think we moved to from the i2 to c3 because
 our bottleneck tended to be CPU utilization (from parsing requests).



 (Discliamer, we're not cassandra veterans but we're not part of the RF=N=3
 club)



 On Tue, Aug 19, 2014 at 10:00 AM, Russell Bradberry rbradbe...@gmail.com
 wrote:

 Short answer, it depends on your use-case.

 We migrated to i2.xlarge nodes and saw an immediate increase in
 performance.  If you just need plain ole raw disk space and don’t have a
 performance requirement to meet then the m1 machines would work, or hell
 even SSD EBS volumes may work for you.  The problem we were having is that
 we couldn’t fill the m1 machines because we needed to add more nodes for
 performance.  Now we have much more power and just the right amount of disk
 space.

 Basically saying, these are not apples-to-apples comparisons



 On August 19, 2014 at 11:57:10 AM, Jeremy Jongsma (jer...@barchart.com)
 wrote:

 The latest consensus around the web for running Cassandra on EC2 seems to
 be use new SSD instances. I've not seen any mention of the elephant in
 the room - using the new SSD instances significantly raises the cluster
 cost per TB. With Cassandra's strength being linear scalability to many
 terabytes of data, it strikes me as odd that everyone is recommending such
 a large storage cost hike almost without reservation.

 Monthly cost comparison for a 100TB cluster (non-reserved instances):

 m1.xlarge (2x420 non-SSD): $30,000 (120 nodes)
 m3.xlarge (2x40 SSD): $250,000 (1250 nodes! Clearly not an option)
 i2.xlarge (1x800 SSD): $76,000 (125 nodes)

 Best case, the cost goes up 150%. How are others approaching these new
 instances? Have you migrated and eaten the costs, or are you staying on
 previous generation until prices come down?





-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: How to maintain the N-most-recent versions of a value?

2014-07-18 Thread Paulo Ricardo Motta Gomes
You might be interested in the following ticket:
https://issues.apache.org/jira/browse/CASSANDRA-3929

There's a patch available that was not integrated because it's not possible
to guarantee exactly N values will be kept, and there are some other
problems with deletions, but it may be useful depending on your usage
characteristics.


On Fri, Jul 18, 2014 at 7:58 AM, Laing, Michael michael.la...@nytimes.com
wrote:

 The cql you provided is invalid. You probably meant something like:

  CREATE TABLE foo (

 rowkey text,

 family text,

 qualifier text,

 version int,

 value blob,

  PRIMARY KEY ((rowkey, family, qualifier), version))

 WITH CLUSTERING ORDER BY (version DESC);


  We use ttl's and LIMIT for structures like these, paying attention to the
 construction of the partition key so that partition sizes are reasonable.

 If the blob might be large, store it somewhere else. We use S3 but you
 could also put it in another C* table.

 In 2.1 the row cache may help as it will store N rows per recently
 accessed partition, starting at the beginning of the partition.

 ml


 On Fri, Jul 18, 2014 at 6:30 AM, Benedict Elliott Smith 
 belliottsm...@datastax.com wrote:

 If the versions can be guaranteed to be a adjacent (i.e. if the latest
 version is V, the prior version is V-1) you could issue a delete at the
 same time as an insert for V-N-(buffer) where buffer = 0

 In general guaranteeing that is probably hard, so this seems like
 something that would be nice to have C* manage for you. Unfortunately we
 don't have anything on the roadmap to help with this. A custom compaction
 strategy might do the trick, or permitting some filter during compaction
 that can omit/tombstone certain records based on the input data. This
 latter option probably wouldn't be too hard to implement, although it might
 not offer any guarantees about expiring records in order without incurring
 extra compaction cost (you could reasonably easily guarantee the most
 recent N are present, but the cleaning up of older records might happen
 haphazardly, in no particular order, and without any promptness guarantees,
 if you want to do it cheaply). Feel free to file a ticket, or submit a
 patch!


 On Fri, Jul 18, 2014 at 1:32 AM, Clint Kelly clint.ke...@gmail.com
 wrote:

 Hi everyone,

 I am trying to design a schema that will keep the N-most-recent
 versions of a value.  Currently my table looks like the following:

 CREATE TABLE foo (
 rowkey text,
 family text,
 qualifier text,
 version long,
 value blob,
 PRIMARY KEY (rowkey, family, qualifier, version))
 WITH CLUSTER ORDER BY (rowkey ASC, family ASC, qualifier ASC, version
 DESC));

 Is there any standard design pattern for updating such a layout such
 that I keep the N-most-recent (version, value) pairs for every unique
 (rowkey, family, qualifier)?  I can't think of any way to do this
 without doing a read-modify-write.  The best thing I can think of is
 to use TTL to approximate the desired behavior (which will work if I
 know how often we are writing new data to the table).  I could also
 use LIMIT N in my queries to limit myself to only N items, but that
 does not address any of the storage-size issues.

 In case anyone is curious, this question is related to some work that
 I am doing translating a system built on HBase (which provides this
 keep the N-most-recent-version-of-a-cell behavior) to Cassandra
 while providing the user with as-similar-as-possible an interface.

 Best regards,
 Clint






-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: unable to find sufficient sources for streaming range

2014-07-02 Thread Paulo Ricardo Motta Gomes
Are you using the -Dcassandra.replace_address=address_of_dead_node flag
to replace the removed node, according to
http://www.datastax.com/documentation/cassandra/1.2/cassandra/operations/ops_replace_node_t.html
?

If yes and the new node has the same address as the replaced node, you
might be hitting CASSANDRA-6622 (
https://issues.apache.org/jira/browse/CASSANDRA-6622), that was fixed only
in 1.2.16.

Cheers,


On Wed, Jul 2, 2014 at 8:14 PM, Daning Wang dan...@netseer.com wrote:

 We are running Cassandra 1.2.5

 We have 8 nodes cluster, and we removed one machine from cluster and try
 to add it back(the purpose is we are using vnodes, some node has more
 tokens so by rejoining this machine we hope it could get some loads from
 the busy machines). But we got following exception and the node cannot add
 to the ring anymore.

 Please help,

 Thanks in advance,


  INFO 16:01:56,260 JOINING: Starting to bootstrap...
 ERROR 16:01:56,514 Exception encountered during startup
 java.lang.IllegalStateException: unable to find sufficient sources for
 streaming range
 (131921530760098415548184818173535242096,132123583169200197961735373586277861750]
 at
 org.apache.cassandra.dht.RangeStreamer.getRangeFetchMap(RangeStreamer.java:205)
 at
 org.apache.cassandra.dht.RangeStreamer.addRanges(RangeStreamer.java:129)
 at
 org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:81)
 at
 org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:924)
 at
 org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:693)
 at
 org.apache.cassandra.service.StorageService.initServer(StorageService.java:548)
 at
 org.apache.cassandra.service.StorageService.initServer(StorageService.java:445)
 at
 org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:325)
 at
 org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:413)
 at
 org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:456)
 java.lang.IllegalStateException: unable to find sufficient sources for
 streaming range
 (131921530760098415548184818173535242096,132123583169200197961735373586277861750]
 at
 org.apache.cassandra.dht.RangeStreamer.getRangeFetchMap(RangeStreamer.java:205)
 at
 org.apache.cassandra.dht.RangeStreamer.addRanges(RangeStreamer.java:129)
 at
 org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:81)
 at
 org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:924)
 at
 org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:693)
 at
 org.apache.cassandra.service.StorageService.initServer(StorageService.java:548)
 at
 org.apache.cassandra.service.StorageService.initServer(StorageService.java:445)
 at
 org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:325)
 at
 org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:413)
 at
 org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:456)
 Exception encountered during startup: unable to find sufficient sources
 for streaming range
 (131921530760098415548184818173535242096,132123583169200197961735373586277861750]
 ERROR 16:01:56,518 Exception in thread
 Thread[StorageServiceShutdownHook,5,main]
 java.lang.NullPointerException
 at
 org.apache.cassandra.service.StorageService.stopRPCServer(StorageService.java:321)
 at
 org.apache.cassandra.service.StorageService.shutdownClientServers(StorageService.java:362)
 at
 org.apache.cassandra.service.StorageService.access$000(StorageService.java:88)
 at
 org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:513)


 Daning




-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: nodetool repair -snapshot option?

2014-06-30 Thread Paulo Ricardo Motta Gomes
If you find it useful, I created a tool where you input the node IP,
keyspace, column family, and optionally the number of partitions (default:
32K), and it outputs the list of subranges for that node, CF, partition
size: https://github.com/pauloricardomg/cassandra-list-subranges

So you can basically iterate over the output of that and do subrange repair
for each node and cf, maybe in parallel. :)


On Mon, Jun 30, 2014 at 10:26 PM, Phil Burress philburress...@gmail.com
wrote:

 One last question. Any tips on scripting a subrange repair?


 On Mon, Jun 30, 2014 at 7:12 PM, Phil Burress philburress...@gmail.com
 wrote:

 We are running repair -pr. We've tried subrange manually and that seems
 to work ok. I guess we'll go with that going forward. Thanks for all the
 info!


 On Mon, Jun 30, 2014 at 6:52 PM, Jaydeep Chovatia 
 chovatia.jayd...@gmail.com wrote:

 Are you running full repair or on subset? If you are running full repair
 then try running on sub-set of ranges which means less data to worry during
 repair and that would help JAVA heap in general. You will have to do
 multiple iterations to complete entire range but at-least it will work.

 -jaydeep


 On Mon, Jun 30, 2014 at 3:22 PM, Robert Coli rc...@eventbrite.com
 wrote:

 On Mon, Jun 30, 2014 at 3:08 PM, Yuki Morishita mor.y...@gmail.com
 wrote:

 Repair uses snapshot option by default since 2.0.2 (see NEWS.txt).


 As a general meta comment, the process by which operationally important
 defaults change in Cassandra seems ad-hoc and sub-optimal.

 For to record, my view was that this change, which makes repair even
 slower than it previously was, was probably overly optimistic.

 It's also weird in that it changes default behavior which has been
 unchanged since the start of Cassandra time and is therefore probably
 automated against. Why was it so critically important to switch to snapshot
 repair that it needed to be shotgunned as a new default in 2.0.2?

 =Rob








-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


repair takes 10x more time in one DC compared to the other

2014-06-25 Thread Paulo Ricardo Motta Gomes
Hello,

I'm running repair on a large CF with the --local flag in 2 different
DCs. In one of the DCs the operation takes about 1 hour per node, while in
the other it takes 10 hours per node.

I would expect the times to differ, but not so much. The writes on that CF
all come from the DC where it takes 10 hours per node, could this be the
cause why it takes so long on this DC?

Additional info: C* 1.2.16, both DCs have the same replication factor.

Cheers,

-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: repair takes 10x more time in one DC compared to the other

2014-06-25 Thread Paulo Ricardo Motta Gomes
Thanks for the explanation, but I got slightly confused:

From my understanding, you just described the behavior of the
-pr/--partitioner-range option: Repair only the first range returned by
the partitioner for the node. , so I would understand that repairs in the
same CFs in different DCs with only the -pr option could take different
times.

However according to the description of the -local/--in-local-dc option, it
only repairs against nodes in the same data center, but you said that the
range will be repaired for all replica in all data-centers, even with the
-local option, or did you confuse it with -pr option?

In any case, I'm using both -local and -pr options, what is the
expected behavior in that case?

Cheers,



On Wed, Jun 25, 2014 at 12:46 PM, Sylvain Lebresne sylv...@datastax.com
wrote:

 TL;DR, this is not unexpected and this is perfectly fine.

 For every node, 'repair --local' will repair the primary (where primary
 means the first range on the ring picked by the consistent hashing for
 this node given its token, nothing more) range of the node in the ring.
 And that range will be repaired for all replica in all data-centers. When
 you assign tokens to multiple DC, it's actually pretty common to offset the
 tokens of one DC slightly compared to the other one. This will result in
 the primary ranges being always small in one DC but not the other. But
 please note that this perfectly ok, it does not imply any imbalance in
 data-centers. It also don't really mean that the node of one DC actually do
 a lot more work than the other ones: all nodes most likely contribute
 roughly the same amount of work to the repair. It only mean that the nodes
 of one DC coordinate more repair work that those of the other DC. Which
 is not really a big deal since coordinating a repair is cheap.

 --
 Sylvain


 On Wed, Jun 25, 2014 at 4:43 PM, Paulo Ricardo Motta Gomes 
 paulo.mo...@chaordicsystems.com wrote:

 Hello,

 I'm running repair on a large CF with the --local flag in 2 different
 DCs. In one of the DCs the operation takes about 1 hour per node, while in
 the other it takes 10 hours per node.

 I would expect the times to differ, but not so much. The writes on that
 CF all come from the DC where it takes 10 hours per node, could this be the
 cause why it takes so long on this DC?

 Additional info: C* 1.2.16, both DCs have the same replication factor.

 Cheers,

 --
 *Paulo Motta*

 Chaordic | *Platform*
 *www.chaordic.com.br http://www.chaordic.com.br/*
 +55 48 3232.3200





-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: repair takes 10x more time in one DC compared to the other

2014-06-25 Thread Paulo Ricardo Motta Gomes
Hmm.. good to find out, thanks for the reference! This explains the time
differences between repairs in different DCs.

But I think using -local and -pr should still be supported simultaneously,
since you may want to repair nodes sequentially in the local DC (-local)
without re-repairing ranges of neighbor nodes (-pr).


On Wed, Jun 25, 2014 at 1:48 PM, Sylvain Lebresne sylv...@datastax.com
wrote:

 I see. Well, you shouldn't use both -local and -pr together, they
 don't make sense together. Which is the reason why their combination will
 be rejected in 2.0.9 (you can check
 https://issues.apache.org/jira/browse/CASSANDRA-7317 for details).
 Basically, the result of using both is that lots of stuffs don't get
 repaired.


 On Wed, Jun 25, 2014 at 6:11 PM, Paulo Ricardo Motta Gomes 
 paulo.mo...@chaordicsystems.com wrote:

 Thanks for the explanation, but I got slightly confused:

 From my understanding, you just described the behavior of the
 -pr/--partitioner-range option: Repair only the first range returned by
 the partitioner for the node. , so I would understand that repairs in the
 same CFs in different DCs with only the -pr option could take different
 times.

 However according to the description of the -local/--in-local-dc option,
 it only repairs against nodes in the same data center, but you said that 
 the
 range will be repaired for all replica in all data-centers, even with the
 -local option, or did you confuse it with -pr option?

 In any case, I'm using both -local and -pr options, what is the
 expected behavior in that case?

 Cheers,



 On Wed, Jun 25, 2014 at 12:46 PM, Sylvain Lebresne sylv...@datastax.com
 wrote:

 TL;DR, this is not unexpected and this is perfectly fine.

 For every node, 'repair --local' will repair the primary (where
 primary means the first range on the ring picked by the consistent hashing
 for this node given its token, nothing more) range of the node in the
 ring. And that range will be repaired for all replica in all data-centers.
 When you assign tokens to multiple DC, it's actually pretty common to
 offset the tokens of one DC slightly compared to the other one. This will
 result in the primary ranges being always small in one DC but not the
 other. But please note that this perfectly ok, it does not imply any
 imbalance in data-centers. It also don't really mean that the node of one
 DC actually do a lot more work than the other ones: all nodes most likely
 contribute roughly the same amount of work to the repair. It only mean that
 the nodes of one DC coordinate more repair work that those of the other
 DC. Which is not really a big deal since coordinating a repair is cheap.

 --
 Sylvain


 On Wed, Jun 25, 2014 at 4:43 PM, Paulo Ricardo Motta Gomes 
 paulo.mo...@chaordicsystems.com wrote:

 Hello,

 I'm running repair on a large CF with the --local flag in 2 different
 DCs. In one of the DCs the operation takes about 1 hour per node, while in
 the other it takes 10 hours per node.

 I would expect the times to differ, but not so much. The writes on that
 CF all come from the DC where it takes 10 hours per node, could this be the
 cause why it takes so long on this DC?

 Additional info: C* 1.2.16, both DCs have the same replication factor.

 Cheers,

 --
 *Paulo Motta*

 Chaordic | *Platform*
 *www.chaordic.com.br http://www.chaordic.com.br/*
 +55 48 3232.3200





 --
 *Paulo Motta*

 Chaordic | *Platform*
 *www.chaordic.com.br http://www.chaordic.com.br/*
 +55 48 3232.3200





-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Best practices for repair

2014-06-19 Thread Paulo Ricardo Motta Gomes
Hello Paolo,

I just published an open source version of the dsetool list_subranges
command, which will enable you to perform subrange repair as described in
the post.

You can find the code and usage instructions here:
https://github.com/pauloricardomg/cassandra-list-subranges

Currently available for 1.2.16, but I guess that just changing the version
on the pom.xml and recompiling it will make it work on 2.0.x.

Cheers,

Paulo


On Thu, Jun 19, 2014 at 4:40 PM, Jack Krupansky j...@basetechnology.com
wrote:

 The DataStax doc should be current best practices:
 http://www.datastax.com/documentation/cassandra/2.0/
 cassandra/operations/ops_repair_nodes_c.html

 If you or anybody else finds it inadequate, speak up.

 -- Jack Krupansky

 -Original Message- From: Paolo Crosato
 Sent: Thursday, June 19, 2014 10:13 AM
 To: user@cassandra.apache.org
 Subject: Best practices for repair


 Hi eveybody,

 we have some problems running repairs on a timely schedule. We have a
 three node deployment, and we start repair on one node every week,
 repairing one columnfamily by one.
 However, when we run into the big column families, usually repair
 sessions hangs undefinitely, and we have to restart them manually.

 The script runs commands like:

 nodetool repair keyspace columnfamily

 one by one.

 This has not been a major issue for some time, since we never delete
 data, however we would like to sort the issue once and for all.

 Reading resources on the net, I came to the conclusion that we could:

 1) either run a repair sessione like the one above, but with the -pr
 switch, and run it on every node, not just on one
 2) or run sub range repair as described here
 http://www.datastax.com/dev/blog/advanced-repair-techniques , which
 would be the best option.
 However the latter procedure would require us to write some java program
 that calls describe_splits to get the tokens to feed nodetool repair with.

 The second procedure is available out of the box only in the commercial
 version of the opscenter, is this true?

 I would like to know if these are the current best practices for repairs
 or if there is some other option that makes repair easier to perform,
 and more
 reliable that it is now.

 Regards,

 Paolo Crosato

 --
 Paolo Crosato
 Software engineer/Custom Solutions
 e-mail: paolo.cros...@targaubiest.com




-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Cannot query secondary index

2014-06-10 Thread Paulo Ricardo Motta Gomes
Our approach for this scenario is to run a hadoop job that periodically
cleans old entries, but I admit it's far from ideal. Would be nice to have
a more native way to perform these kinds of tasks.

There's a legend about a compaction strategy that keeps only the N first
entries of a partition key, but I don't think it was implemented yet, but
if I remember correctly there's a JIRA ticket about it.


On Tue, Jun 10, 2014 at 3:39 PM, Redmumba redmu...@gmail.com wrote:

 Honestly, this has been by far my single biggest obstacle with Cassandra
 for time-based data--cleaning up the old data when the deletion criteria
 (i.e., date) isn't the primary key.  I've asked about a few different
 approaches, but I haven't really seen any feasible options that can be
 implemented easily.  I've seen the following:

1. Use date-based tables, then drop old tables, ala
audit_table_20140610, audit_table_20140609, etc..
But then I run into the issue of having to query every table--I would
have to execute queries against every day to get the data, and then merge
the data myself.  Unless, there's something in the binary driver I'm
missing, it doesn't sound like this would be practical.
2. Use a TTL
But then I have to basically decide on a value that works for
everything and, if it ever turns out I overestimated, I'm basically SOL,
because my cluster will be out of space.
3. Maintain a separate index of days to keys, and use this index as
the reference for which keys to delete.
But then this requires maintaining another index and a relatively
manual delete.

 I can't help but feel that I am just way over-engineering this, or that
 I'm missing something basic in my data model.  Except for the last
 approach, I can't help but feel that I'm overlooking something obvious.

 Andrew


 Of course, Jonathan, I'll do my best!

 It's an auditing table that, right now, uses a primary key consisting of a
 combination of a combined partition id of the region and the object id, the
 date, and the process ID.  Each event in our system will create anywhere
 from 1-20 rows, for example, and multiple parts of the system might be
 working on the same object ID.  So the CF is constantly being appended
 to, but reads are rare.

 CREATE TABLE audit (
 id bigint,
 region ascii,
 date timestamp,
 pid int,
 PRIMARY KEY ((id, region), date, pid)
 );


 Data is queried on a specific object ID and region.  Optionally, users can
 restrict their query to a specific date range, which the above data model
 provides.

 However, we generate quite a bit of data, and we want a convenient way to
 get rid of the oldest data.  Since our system scales with the time of year,
 we might get 50GB a day during peak, and 5GB of data off peak.  We could
 pick the safest number--let's say, 30 days--and set the TTL using that.
 The problem there is that, most of the year, we'll be using a very small
 percentage of our available space 90% of the year.

 What I'd like to be able to do is drop old tables as needed--i.e., let's
 say when we hit 80% load across the cluster (or some such metric that takes
 the cluster-wide load into account), I want to drop the oldest day's
 records until we're under 80%.  That way, we're always using the maximum
 amount of space we can, without having to worry about getting to the point
 where we run out of space cluster-wide.

 My thoughts are--we could always make the date part of the primary key,
 but then we'd either a) have to query the entire range of dates, or b) we'd
 have to force a small date range when querying.  What are the penalties?
 Do you have any other suggestions?


 On Mon, Jun 9, 2014 at 5:15 PM, Jonathan Lacefield 
 jlacefi...@datastax.com wrote:

 Hello,

   Will you please describe the use case and what you are trying to model.
  What are some questions/queries that you would like to serve via
 Cassandra.  This will help the community help you a little better.

 Jonathan Lacefield
 Solutions Architect, DataStax
 (404) 822 3487
  http://www.linkedin.com/in/jlacefield

 http://www.datastax.com/cassandrasummit14



 On Mon, Jun 9, 2014 at 7:51 PM, Redmumba redmu...@gmail.com wrote:

 I've been trying to work around using date-based tables because I'd
 like to avoid the overhead.  It seems, however, that this is just not going
 to work.

 So here's a question--for these date-based tables (i.e., a table per
 day/week/month/whatever), how are they queried?  If I keep 60 days worth of
 auditing data, for example, I'd need to query all 60 tables--can I do that
 smoothly?  Or do I have to have 60 different select statements?  Is there a
 way for me to run the same query against all the tables?


 On Mon, Jun 9, 2014 at 3:42 PM, Redmumba redmu...@gmail.com wrote:

 Ah, so the secondary indices are really secondary against the primary
 key.  That makes sense.

 I'm beginning to see why the whole date-based table approach is the
 only one I've been able to 

Re: I have a deaf node?

2014-06-01 Thread Paulo Ricardo Motta Gomes
This post should definitely make to the hall of fame!! :)


On Mon, Jun 2, 2014 at 12:05 AM, Tim Dunphy bluethu...@gmail.com wrote:

 That made my day. Not to worry thought unless you  start seeing the number
 23 in your host ids.


 Yeah man, glad to provide some comic relief to the list! ;)


 On Sun, Jun 1, 2014 at 11:01 PM, Apostolis Xekoukoulotakis 
 xekou...@gmail.com wrote:

 That made my day. Not to worry thought unless you  start seeing the
 number 23 in your host ids.
  On Jun 2, 2014 12:40 AM, Kevin Burton bur...@spinn3r.com wrote:

 could be worse… it could be under caffeinated and say decafbad …


 On Sat, May 31, 2014 at 10:45 AM, Tim Dunphy bluethu...@gmail.com
 wrote:

 I think the deaf thing is just the ending of the host ID in
 hexadecimal. It's an extraordinary coincidence that it ends with DEAF :D


 Hah.. yeah that thought did cross my mind.  :)



 On Sat, May 31, 2014 at 1:35 PM, DuyHai Doan doanduy...@gmail.com
 wrote:

 I think the deaf thing is just the ending of the host ID in
 hexadecimal. It's an extraordinary coincidence that it ends with DEAF :D


 On Sat, May 31, 2014 at 6:38 PM, Tim Dunphy bluethu...@gmail.com
 wrote:

 I didn't realize cassandra nodes could develop hearing problems. :)


 But I have a dead node in my cluster I would like to get rid of.

 [root@beta:~] #nodetool status
 Datacenter: datacenter1
 ===
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 --  Address Load   Tokens  Owns   Host ID
   Rack
 UN  10.10.1.94  199.6 KB   256 49.4%
  fd2f76ae-8dcf-4e93-a37f-bf1e9088696e  rack1
 DN  10.10.1.64  ?  256 50.6%
  f2a48fc7-a362-43f5-9061-4bb3739f*deaf * rack1

 I was just wondering what this could indicate and if that might mean
 that I will have some more trouble than I would be bargaining for in
 getting rid of it.

 I've made a couple of attempts to get rid of this so far. I'm about
 to try again.

 Thanks
 Tim

 --
 GPG me!!

 gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B





 --
 GPG me!!

 gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B




 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 Skype: *burtonator*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com
 War is peace. Freedom is slavery. Ignorance is strength. Corporations
 are people.




 --
 GPG me!!

 gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B




-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: How does cassandra page through low cardinality indexes?

2014-05-29 Thread Paulo Ricardo Motta Gomes
Really informative thread, thank you!

We had a secondary index trauma a while ago, and since then we knew it was
not a good idea for most of the cases, but now it's even more clear why.


On Thu, May 29, 2014 at 5:31 PM, Robert Coli rc...@eventbrite.com wrote:

 On Thu, May 29, 2014 at 1:08 PM, DuyHai Doan doanduy...@gmail.com wrote:

 Hello Robert

  There are some maths involved when considering the performance of
 secondary index in C*


 Yes, these are the maths which are behind my FIXMEs in the original post.
 I merely have not had time to explicitly describe them in the context of
 that draft post.

 Thank you for doing so! When I reference them in my eventual post, I will
 be sure to credit you.


  Because of its distributed nature, finding a *good* use-case for 2nd
 index is quite tricky, partly because it  depends on the query pattern but
 also on the cluster size and data distribution.


 Yep, and if you're doing this tricky thing, you probably want less opacity
 and more explicit understanding of what is happening under the hood and you
 want to be sure you won't run into a bug in the implementation, hence
 manual secondary index CFs.


   Apart from the performance aspect, secondary index column families use
 SizeTiered compaction so for an use case with a lot of update you'll have
 plenty of tombstones... I'm not sure how end user can switch to Leveled
 Compaction for 2nd index...


 Per Aleksey, secondary index column families actually use the compaction
 strategy of the column family they index. I agree that this seems weird,
 and is likely just another implementation detail you relinquish control of
 for the convenience of 2i.

 =Rob




-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Number of rows under one partition key

2014-05-29 Thread Paulo Ricardo Motta Gomes
Hey,

We are considering upgrading from 1.2 to 2.0, why don't you consider 2.0
ready for production yet, Robert? Have you wrote about this somewhere
already?

A bit off-topic in this discussion but it would be interesting to know,
your posts are generally very enlightening.

Cheers,


On Thu, May 29, 2014 at 8:51 PM, Robert Coli rc...@eventbrite.com wrote:

 On Thu, May 15, 2014 at 6:10 AM, Vegard Berget p...@fantasista.no wrote:

 I know this has been discussed before, and I know there are limitations
 to how many rows one partition key in practice can handle.  But I am not
 sure if number of rows or total data is the deciding factor.


 Both. In terms of data size, partitions containing over a small number of
 hundreds of Megabytes begin to see diminishing returns in some cases.
 Partitions over 64 megabytes are compacted on disk, which should give you a
 rough sense of what Cassandra considers a large partition.


 Should we add another partition key to avoid 1 000 000 rows in the same
 thrift-row (which is how I understand it is actually stored)?  Or is 1 000
 000 rows okay?


 Depending on row size and access patterns, 1Mn rows is not extremely
 large. There are, however, some row sizes and operations where this order
 of magnitude of columns might be slow.


 Other considerations, for example compaction strategy and if we should do
 an upgrade to 2.0 because of this (we will upgrade anyway, but if it is
 recommended we will continue to use 2.0 in development and upgrade the
 production environment sooner)


 You should not upgrade to 2.0 in order to address this concern. You should
 upgrade to 2.0 when it is stable enough to run in production, which IMO is
 not yet. YMMV.


 I have done some testing, inserting a million rows and selecting them
 all, counting them and selecting individual rows (with both clientid and
 id) and it seems fine, but I want to ask to be sure that I am on the right
 track.


 If the access patterns you are using perform the way you would like with
 representative size data, sounds reasonable to me?

 If you are able to select all million rows within a reasonable percentage
 of the relevant timeout, I presume they cannot be too huge in terms of data
 size! :D

 =Rob




-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Suggestions for upgrading cassandra

2014-05-27 Thread Paulo Ricardo Motta Gomes
I've written a bit about upgrading from 1.1 to 1.2, non-vnodes:
http://monkeys.chaordic.com.br/operation/zero-downtime-cassandra-upgrade/

Some tips may be valid for a more recent upgrade, but I'm sure the
community has more specific tips regarding the upgrade from 1.2 to 2.0.

On Tue, May 27, 2014 at 2:57 PM, Eric Plowe eric.pl...@gmail.com wrote:

 i have a cluster that is running 1.2.6. I'd like to upgrade that cluster
 to 2.0.7

 Any suggestions/tips that would make the upgrade process smooth?




-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: cassandra boot is stuck in hint compaction.

2014-05-25 Thread Paulo Ricardo Motta Gomes
What is the Cassandra version? Are the same sstables being compacted over
and over?

Please post a sample of the compaction log and the output of DESCRIBE
TABLE system.hints; on cqlsh.

Cheers,


On Sun, May 25, 2014 at 6:12 AM, Igor Shprukh i...@newage.co.il wrote:


  --


 ​hi guys, we have a 6 node cluster, consisting of 5 linux machines and a
 windows one.

 after a hard shutdown of the windows machine, the node is stuck on hints
 compaction for more than

 half an hour and cassandra won't start. must say that it is a strong
 machine with 16gb of ram and 250 gb of space dedicated to the node. all
 other nodes are up.


  what could be the problem causing this?

 thank you in advance.





-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Disable FS journaling

2014-05-20 Thread Paulo Ricardo Motta Gomes
Hello,

Has anyone disabled file system journaling on Cassandra nodes? Does it make
any difference on write performance?

Cheers,

-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Disable FS journaling

2014-05-20 Thread Paulo Ricardo Motta Gomes
Thanks for the links!

Forgot to mention, using XFS here, as suggested by the Cassandra wiki. But
just double checked and it's apparently not possible to disable journaling
on XFS.

One of ours sysadmin just suggested disabling journaling, since it's mostly
for recovery purposes, and Cassandra already does that pretty well with
commitlog, replication and anti-entropy. It would anyway be nice to know if
there could be any performance benefits from it. But I personally don't
think it would help much, due to the append-only nature of cassandra writes.


On Tue, May 20, 2014 at 12:43 PM, Michael Shuler mich...@pbandjelly.orgwrote:

 On 05/20/2014 09:54 AM, Samir Faci wrote:

 I'm not sure you'd be gaining much by doing this.  This is probably
 dependent on the file system you're referring to when you say
 journaling.  There's a few of them around,

 You could opt to use ext2 instead of ext3/4 in the unix world.  A quick
 google search linked me to this:


 ext2/3 is not a good choice for file size limitation and performance
 reasons.

 I started to search for a couple links, and a quick check of the links I
 posted a couple years ago seem to still be interesting  ;)

 http://mail-archives.apache.org/mod_mbox/cassandra-user/
 201204.mbox/%3c4f7c5c16.1020...@pbandjelly.org%3E

 (repost from above)

 Hopefully this is some good reading on the topic:

 https://www.google.com/search?q=xfs+site%3Ahttp%3A%2F%
 2Fmail-archives.apache.org%2Fmod_mbox%2Fcassandra-user

 one of the more interesting considerations:
 http://mail-archives.apache.org/mod_mbox/cassandra-user/201004.mbox/%
 3ch2y96b607d1004131614k5382b3a5ie899989d62921...@mail.gmail.com%3E

 http://wiki.apache.org/cassandra/CassandraHardware

 http://wiki.apache.org/cassandra/LargeDataSetConsiderations

 http://www.datastax.com/dev/blog/questions-from-the-tokyo-
 cassandra-conference

 --
 Kind regards,
 Michael




-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Disable FS journaling

2014-05-20 Thread Paulo Ricardo Motta Gomes
On Tue, May 20, 2014 at 1:24 PM, Terje Marthinussen tmarthinus...@gmail.com
 wrote:

 Journal enabled is faster on almost all operations.


Good to know, thanks!



 Recovery here is more about saving you from waiting 1/2 hour from a
 traditional full file system check.


On an EC2 environment you normally lose the machine anyway on failures, so
that's not of much use in that case.


 Feel free to wait if you want though! :)

 Regards,
 Terje

 On 21 May 2014, at 01:11, Paulo Ricardo Motta Gomes 
 paulo.mo...@chaordicsystems.com wrote:

 Thanks for the links!

 Forgot to mention, using XFS here, as suggested by the Cassandra wiki. But
 just double checked and it's apparently not possible to disable journaling
 on XFS.

 One of ours sysadmin just suggested disabling journaling, since it's
 mostly for recovery purposes, and Cassandra already does that pretty well
 with commitlog, replication and anti-entropy. It would anyway be nice to
 know if there could be any performance benefits from it. But I personally
 don't think it would help much, due to the append-only nature of cassandra
 writes.


 On Tue, May 20, 2014 at 12:43 PM, Michael Shuler 
 mich...@pbandjelly.orgwrote:

 On 05/20/2014 09:54 AM, Samir Faci wrote:

 I'm not sure you'd be gaining much by doing this.  This is probably
 dependent on the file system you're referring to when you say
 journaling.  There's a few of them around,

 You could opt to use ext2 instead of ext3/4 in the unix world.  A quick
 google search linked me to this:


 ext2/3 is not a good choice for file size limitation and performance
 reasons.

 I started to search for a couple links, and a quick check of the links I
 posted a couple years ago seem to still be interesting  ;)

 http://mail-archives.apache.org/mod_mbox/cassandra-user/
 201204.mbox/%3c4f7c5c16.1020...@pbandjelly.org%3E

 (repost from above)

 Hopefully this is some good reading on the topic:

 https://www.google.com/search?q=xfs+site%3Ahttp%3A%2F%
 2Fmail-archives.apache.org%2Fmod_mbox%2Fcassandra-user

 one of the more interesting considerations:
 http://mail-archives.apache.org/mod_mbox/cassandra-user/201004.mbox/%
 3ch2y96b607d1004131614k5382b3a5ie899989d62921...@mail.gmail.com%3E

 http://wiki.apache.org/cassandra/CassandraHardware

 http://wiki.apache.org/cassandra/LargeDataSetConsiderations

 http://www.datastax.com/dev/blog/questions-from-the-tokyo-
 cassandra-conference

 --
 Kind regards,
 Michael




 --
 *Paulo Motta*

 Chaordic | *Platform*
 *www.chaordic.com.br http://www.chaordic.com.br/*
 +55 48 3232.3200




-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Disable reads during node rebuild

2014-05-16 Thread Paulo Ricardo Motta Gomes
That'll be really useful, thanks!!


On Wed, May 14, 2014 at 7:47 PM, Aaron Morton aa...@thelastpickle.comwrote:

 As of 2.0.7, driftx has added this long-requested feature.

 Thanks

 A
 -
 Aaron Morton
 New Zealand
 @aaronmorton

 Co-Founder  Principal Consultant
 Apache Cassandra Consulting
 http://www.thelastpickle.com

 On 13/05/2014, at 9:36 am, Robert Coli rc...@eventbrite.com wrote:

 On Mon, May 12, 2014 at 10:18 AM, Paulo Ricardo Motta Gomes 
 paulo.mo...@chaordicsystems.com wrote:

 Is there a way to disable reads from a node while performing rebuild from
 another datacenter? I tried starting the node in write survery mode, but
 the nodetool rebuild command does not work in this mode.


 As of 2.0.7, driftx has added this long-requested feature.

 https://issues.apache.org/jira/browse/CASSANDRA-6961

 Note that it is impossible to completely close the race window here as
 long as writes are incoming, this functionality just dramatically shortens
 it.

 =Rob






-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Mutation messages dropped

2014-05-16 Thread Paulo Ricardo Motta Gomes
It means asynchronous write mutations were dropped, but if the writes are
completing without TimedOutException, then at least ConsistencyLevel
replicas were correctly written. The remaining replicas will eventually be
fixed by hinted handoff, anti-entropy (repair) or read repair.

More info: http://wiki.apache.org/cassandra/FAQ#dropped_messages

Please note that 1 mutation != 1 record. For instance, if 1 row has N
columns, than a record write for that row will have N mutations AFAIK
(please correct me if I'm wrong).

On Fri, May 9, 2014 at 8:52 AM, Raveendran, Varsha IN BLR STS 
varsha.raveend...@siemens.com wrote:

  Hello,

 I am writing around 10Million records continuously into a single node
 Cassandra (2.0.5) .
 In the Cassandra log file I see an entry “*272 MUTATION messages dropped
 in last 5000ms*” . Does this mean that 272 records were not written
 successfully?

 Thanks,
 Varsha





-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)

2014-05-16 Thread Paulo Ricardo Motta Gomes
Hello Anton,

What version of Cassandra are you using? If between 1.2.6 and 2.0.6 the
setInputRange(startToken, endToken) is not working.

This was fixed in 2.0.7:
https://issues.apache.org/jira/browse/CASSANDRA-6436

If you can't upgrade you can copy AbstractCFIF and CFIF to your project and
apply the patch there.

Cheers,

Paulo


On Wed, May 14, 2014 at 10:29 PM, Anton Brazhnyk anton.brazh...@genesys.com
 wrote:

 Greetings,

 I'm reading data from C* with Spark (via ColumnFamilyInputFormat) and I'd
 like to read just part of it - something like Spark's sample() function.
 Cassandra's API seems allow to do it with its
 ConfigHelper.setInputRange(jobConfiguration, startToken, endToken) method,
 but it doesn't work.
 The limit is just ignored and the entire column family is scanned. It
 seems this kind of feature is just not supported
 and sources of AbstractColumnFamilyInputFormat.getSplits confirm that
 (IMO).
 Questions:
 1. Am I right that there is no way to get some data limited by token range
 with ColumnFamilyInputFormat?
 2. Is there other way to limit the amount of data read from Cassandra with
 Spark and ColumnFamilyInputFormat,
 so that this amount is predictable (like 5% of entire dataset)?


 WBR,
 Anton





-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Efficient bulk range deletions without compactions by dropping SSTables.

2014-05-16 Thread Paulo Ricardo Motta Gomes
Hello Kevin,

In 2.0.X an SSTable is automatically dropped if it contains only
tombstones: https://issues.apache.org/jira/browse/CASSANDRA-5228. However
this will most likely happen if you use LCS. STCS will create sstables of
larger size that will probably have mixed expired and unexpired data.  This
could be solved by the single-sstable tombstone compaction that
unfortunately is not working well (
https://issues.apache.org/jira/browse/CASSANDRA-6563).

I don't know of a way to manually drop specific sstables safely, you could
try implementing a script that compares sstables timestamps to check if an
sstable is safely droppable as done in CASSANDRA-5228. There are proposals
to create a compaction strategy optimized for log only data that only
deletes old sstables but it's not ready yet AFAIK.

Cheers,

Paulo

On Mon, May 12, 2014 at 8:53 PM, Kevin Burton bur...@spinn3r.com wrote:

 We have a log only data structure… everything is appended and nothing is
 ever updated.

 We should be totally fine with having lots of SSTables sitting on disk
 because even if we did a major compaction the data would still look the
 same.

 By 'lots' I mean maybe 1000 max.  Maybe 1GB each.

 However, I would like a way to delete older data.

 One way to solve this could be to just drop an entire SSTable if all the
 records inside have tombstones.

 Is this possible, to just drop a specific SSTable?

 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 Skype: *burtonator*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ 
 profilehttps://plus.google.com/102718274791889610666/posts
 http://spinn3r.com
 War is peace. Freedom is slavery. Ignorance is strength. Corporations are
 people.




-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Automatic tombstone removal issue (STCS)

2014-05-15 Thread Paulo Ricardo Motta Gomes
I just updated CASSANDRA-6563 with more details and proposed a patch to
solve the issue, in case anyone else is interested.

https://issues.apache.org/jira/browse/CASSANDRA-6563

On Tue, May 6, 2014 at 10:00 PM, Paulo Ricardo Motta Gomes 
paulo.mo...@chaordicsystems.com wrote:

 Robert: thanks for the support, you are right, this belonged more to the
 dev list but I didn't think of it.

 Yuki: thanks a lot for the clarification, this is what I suspected.

 I understand it's costly to check row by row overlap in order to decide if
 a SSTable is candidate for compaction, but doesn't the compaction process
 already performs this check when removing tombstones? So, couldn't this
 check be dropped during decision time and let the compaction run anyway?

 This optimization is specially interesting with large STCS sstables, where
 the token range will very likely overlap with all other sstables, so it's a
 pity it's almost never being triggered in these cases.

 On Tue, May 6, 2014 at 9:32 PM, Yuki Morishita mor.y...@gmail.com wrote:

 Hi Paulo,

 The reason we check overlap is not to resurrect deleted data by only
 dropping tombstone marker from single SSTable.
 And we don't want to check row by row to determine if SSTable is
 droppable since it takes time, so we use token ranges to determine if
 it MAY have droppable columns.

 On Tue, May 6, 2014 at 7:14 PM, Paulo Ricardo Motta Gomes
 paulo.mo...@chaordicsystems.com wrote:
  Hello,
 
  Sorry for being persistent, but I'd love to clear my understanding on
 this.
  Has anyone seen single sstable compaction being triggered for STCS
 sstables
  with high tombstone ratio?
 
  Because if the above understanding is correct, the current
 implementation
  almost never triggers this kind of compaction, since the token ranges
 of a
  node's sstable almost always overlap. Could this be a bug or is it
 expected
  behavior?
 
  Thank you,
 
 
 
  On Mon, May 5, 2014 at 8:59 AM, Paulo Ricardo Motta Gomes
  paulo.mo...@chaordicsystems.com wrote:
 
  Hello,
 
  After noticing that automatic tombstone removal (CASSANDRA-3442) was
 not
  working in an append-only STCS CF with 40% of droppable tombstone
 ratio I
  investigated why the compaction was not being triggered in the largest
  SSTable with 16GB and about 70% droppable tombstone ratio.
 
  When the code goes to check if the SSTable is candidate to be compacted
  (AbstractCompactionStrategy.worthDroppingTombstones), it verifies if
 all the
  others SSTables overlap with the current SSTable by checking if the
 start
  and end tokens overlap. The problem is that all SSTables contain
 pretty much
  the whole node token range, so all of them overlap nearly all the
 time, so
  the automatic tombstone removal never happens. Is there any case in
 STCS
  where all sstables token ranges DO NOT overlap?
 
  I understand during the tombstone removal process it's necessary to
 verify
  if the compacted row exists in any other SSTable, but I don't
 understand why
  it's necessary to verify if the token ranges overlap to decide if a
  tombstone compaction must be executed on a single SSTable with high
  droppable tombstone ratio.
 
  Any clarification would be kindly appreciated.
 
  PS: Cassandra version: 1.2.16
 
  --
  Paulo Motta
 
  Chaordic | Platform
  www.chaordic.com.br
  +55 48 3232.3200
 
 
 
 
  --
  Paulo Motta
 
  Chaordic | Platform
  www.chaordic.com.br
  +55 48 3232.3200



 --
 Yuki Morishita
  t:yukim (http://twitter.com/yukim)




 --
 *Paulo Motta*

 Chaordic | *Platform*
 *www.chaordic.com.br http://www.chaordic.com.br/*
 +55 48 3232.3200




-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Bootstrap failure on C* 1.2.13

2014-05-14 Thread Paulo Ricardo Motta Gomes
Hello,

After about 3 months I was able to solve this issue, which happened again
after another node died.

The problem is the datastax 1.2 node replacement docs [1] said that This
procedure applies to clusters using vnodes. If not using vnodes, use the
instructions in the Cassandra 1.1 documentation.

However, the 1.1 docs did not mention the property
-Dcassandra.replace_address=address_of_dead_node, which was only
introduced in 1.2. So, what happens without this flag is that the
replacement node tries to stream data from the dead node, failing the
bootstrap process. Adding this flag solves the problem.

Big thanks to driftx from #cassandra who helped troubleshoot the issue. The
docs were already updated to mention the property even for non-vnodes
cluster.

[1]
http://www.datastax.com/documentation/cassandra/1.2/cassandra/operations/ops_replace_node_t.html

Cheers,

On Sat, Feb 15, 2014 at 3:31 PM, Alain RODRIGUEZ arodr...@gmail.com wrote:

 Hi Rob,

 I don't understand how setting those initial_token might solve this
 issue. Even more since we cannot set them before bootstrapping...

 Plus, once those tokens set, we would have to modify them after any new
 bootstrap / decommission. Which would also imply to run a rolling restart
 for the new configuration (cassandra.yaml)  to be taken into account. This
 is quite a heavy process to perform a NOOP...

 What did I miss ?

 Thanks for getting involved and trying to help anyway :).

 Alain


 2014-02-15 1:13 GMT+01:00 Robert Coli rc...@eventbrite.com:

 On Fri, Feb 14, 2014 at 10:08 AM, Paulo Ricardo Motta Gomes 
 paulo.mo...@chaordicsystems.com wrote:

 But in our case, our cluster was not using VNodes, so this workaround
 will probably not work with VNodes, since you cannot specify the 256 tokens
 from the old node.


 Sure you can, in a comma delimited list. I plan to write a short blog
 post about this, but...

 I recommend that anyone using Cassandra, vnodes or not, always explicitly
 populate their initial_token line in cassandra.yaml. There are a number of
 cases where you will lose if you do not do so, and AFAICT no cases where
 you lose by doing so.

 If one is using vnodes and wants to do this, the process goes like :

 1) set num_tokens to the desired number of vnodes
 2) start node/bootstrap
 3) use a one liner like jeffj's :
 
 nodetool info -T | grep ^Token | awk '{ print $3 }' | tr \\n , | sed -e
 's/,$/\n/'
 
 to get a comma delimited list of the vnode tokens
 4) insert this comma delimited list in initial_token, and comment out
 num_tokens (though it is a NOOP)

 =Rob





-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Cassandra hadoop job fails if any node is DOWN

2014-05-14 Thread Paulo Ricardo Motta Gomes
Hello,

One of the nodes of our Analytics DC is dead, but ColumnFamilyInputFormat
(CFIF) still assigns Hadoop input splits to it. This leads to many failed
tasks and consequently a failed job.

* Tasks fail with: java.lang.RuntimeException:
org.apache.thrift.transport.TTransportException: Failed to open a transport
to XX.75:9160. (obviously, the node is dead)

* Job fails with: Job Failed: # of failed Map Tasks exceeded allowed limit.
FailedCount: 1. LastFailedTask: task_201404180250_4207_m_79

We use RF=2 and CL=LOCAL_ONE for hadoop jobs, C* 1.2.16. Is this expected
behavior?

I checked CFIF code, but it always assigns input splits to all the ring
nodes, no matter if the node is dead or alive. What we do to fix is patch
CFIF to blacklist the dead node, but this is not very automatic procedure.
Am I not getting something here?

Cheers,

-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Disable reads during node rebuild

2014-05-14 Thread Paulo Ricardo Motta Gomes
That's a nice workaround, will be really helpful in emergency situations
like this.

Thanks,


On Mon, May 12, 2014 at 6:58 PM, Aaron Morton aa...@thelastpickle.comwrote:

 I'm not able to replace a dead node using the ordinary procedure
 (boostrap+join), and would like to rebuild the replacement node from
 another DC.

 Normally when you want to add a new DC to the cluster the command to use
 is nodetool rebuild $DC_NAME .(with auto_bootstrap: false) That will get
 the node to stream data from the $DC_NAME

 The problem is that if I start a node with auto_bootstrap=false to perform
 the rebuild, it automatically starts serving empty reads (CL=LOCAL_ONE).

 When adding a new DC the nodes wont be processing reads, that is not the
 case for you.

 You should disable the client API’s to prevent the clients from calling
 the new nodes, use -Dcassandra.start_rpc=false and
 -Dcassandra.start_native_transport=false in cassandra-env.sh or appropriate
 settings in cassandra.yaml

 Disabling reads from other nodes will be harder. IIRC during bootstrap a
 different timeout (based on ring_delay) is used to detect if the
 bootstrapping node is down. However if the node is running and you use
 nodetool rebuild i’m pretty sure the normal gossip failure detectors will
 kick in. Which means you cannot disable gossip to prevent reads. Also we
 would want the node to be up for writes.

 But what you can do is artificially set the severity of the node high so
 the dynamic snitch will route around it. See
 https://github.com/apache/cassandra/blob/cassandra-2.0/src/java/org/apache/cassandra/locator/DynamicEndpointSnitchMBean.java#L37


 * Set the value to something high on the node you will be rebuilding, the
 number or cores on the system should do.  (jmxterm is handy for this
 http://wiki.cyclopsgroup.org/jmxterm)
 * Check nodetool gossipinfo on the other nodes to see the SEVERITY app
 state has propagated.
 * Watch completed ReadStage tasks on the node you want to rebuild. If you
 have read repair enabled it will still get some traffic.
 * Do rebuild
 * Reset severity to 0

 Hope that helps.
 Aaron

 -
 Aaron Morton
 New Zealand
 @aaronmorton

 Co-Founder  Principal Consultant
 Apache Cassandra Consulting
 http://www.thelastpickle.com

 On 13/05/2014, at 5:18 am, Paulo Ricardo Motta Gomes 
 paulo.mo...@chaordicsystems.com wrote:

 Hello,

 I'm not able to replace a dead node using the ordinary procedure
 (boostrap+join), and would like to rebuild the replacement node from
 another DC. The problem is that if I start a node with auto_bootstrap=false
 to perform the rebuild, it automatically starts serving empty reads
 (CL=LOCAL_ONE).

 Is there a way to disable reads from a node while performing rebuild from
 another datacenter? I tried starting the node in write survery mode, but
 the nodetool rebuild command does not work in this mode.

 Thanks,

 --
 *Paulo Motta*

 Chaordic | *Platform*
 *www.chaordic.com.br http://www.chaordic.com.br/*
 +55 48 3232.3200





-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Automatic tombstone removal issue (STCS)

2014-05-07 Thread Paulo Ricardo Motta Gomes
Robert: thanks for the support, you are right, this belonged more to the
dev list but I didn't think of it.

Yuki: thanks a lot for the clarification, this is what I suspected.

I understand it's costly to check row by row overlap in order to decide if
a SSTable is candidate for compaction, but doesn't the compaction process
already performs this check when removing tombstones? So, couldn't this
check be dropped during decision time and let the compaction run anyway?

This optimization is specially interesting with large STCS sstables, where
the token range will very likely overlap with all other sstables, so it's a
pity it's almost never being triggered in these cases.

On Tue, May 6, 2014 at 9:32 PM, Yuki Morishita mor.y...@gmail.com wrote:

 Hi Paulo,

 The reason we check overlap is not to resurrect deleted data by only
 dropping tombstone marker from single SSTable.
 And we don't want to check row by row to determine if SSTable is
 droppable since it takes time, so we use token ranges to determine if
 it MAY have droppable columns.

 On Tue, May 6, 2014 at 7:14 PM, Paulo Ricardo Motta Gomes
 paulo.mo...@chaordicsystems.com wrote:
  Hello,
 
  Sorry for being persistent, but I'd love to clear my understanding on
 this.
  Has anyone seen single sstable compaction being triggered for STCS
 sstables
  with high tombstone ratio?
 
  Because if the above understanding is correct, the current implementation
  almost never triggers this kind of compaction, since the token ranges of
 a
  node's sstable almost always overlap. Could this be a bug or is it
 expected
  behavior?
 
  Thank you,
 
 
 
  On Mon, May 5, 2014 at 8:59 AM, Paulo Ricardo Motta Gomes
  paulo.mo...@chaordicsystems.com wrote:
 
  Hello,
 
  After noticing that automatic tombstone removal (CASSANDRA-3442) was not
  working in an append-only STCS CF with 40% of droppable tombstone ratio
 I
  investigated why the compaction was not being triggered in the largest
  SSTable with 16GB and about 70% droppable tombstone ratio.
 
  When the code goes to check if the SSTable is candidate to be compacted
  (AbstractCompactionStrategy.worthDroppingTombstones), it verifies if
 all the
  others SSTables overlap with the current SSTable by checking if the
 start
  and end tokens overlap. The problem is that all SSTables contain pretty
 much
  the whole node token range, so all of them overlap nearly all the time,
 so
  the automatic tombstone removal never happens. Is there any case in STCS
  where all sstables token ranges DO NOT overlap?
 
  I understand during the tombstone removal process it's necessary to
 verify
  if the compacted row exists in any other SSTable, but I don't
 understand why
  it's necessary to verify if the token ranges overlap to decide if a
  tombstone compaction must be executed on a single SSTable with high
  droppable tombstone ratio.
 
  Any clarification would be kindly appreciated.
 
  PS: Cassandra version: 1.2.16
 
  --
  Paulo Motta
 
  Chaordic | Platform
  www.chaordic.com.br
  +55 48 3232.3200
 
 
 
 
  --
  Paulo Motta
 
  Chaordic | Platform
  www.chaordic.com.br
  +55 48 3232.3200



 --
 Yuki Morishita
  t:yukim (http://twitter.com/yukim)




-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Automatic tombstone removal issue (STCS)

2014-05-06 Thread Paulo Ricardo Motta Gomes
Hello,

Sorry for being persistent, but I'd love to clear my understanding on this.
Has anyone seen single sstable compaction being triggered for STCS sstables
with high tombstone ratio?

Because if the above understanding is correct, the current implementation
almost never triggers this kind of compaction, since the token ranges of a
node's sstable almost always overlap. Could this be a bug or is it expected
behavior?

Thank you,



On Mon, May 5, 2014 at 8:59 AM, Paulo Ricardo Motta Gomes 
paulo.mo...@chaordicsystems.com wrote:

 Hello,

 After noticing that automatic tombstone removal (CASSANDRA-3442) was not
 working in an append-only STCS CF with 40% of droppable tombstone ratio I
 investigated why the compaction was not being triggered in the largest
 SSTable with 16GB and about 70% droppable tombstone ratio.

 When the code goes to check if the SSTable is candidate to be compacted
 (AbstractCompactionStrategy.worthDroppingTombstones), it verifies if all
 the others SSTables overlap with the current SSTable by checking if the
 start and end tokens overlap. The problem is that all SSTables contain
 pretty much the whole node token range, so all of them overlap nearly all
 the time, so the automatic tombstone removal never happens. Is there any
 case in STCS where all sstables token ranges DO NOT overlap?

 I understand during the tombstone removal process it's necessary to verify
 if the compacted row exists in any other SSTable, but I don't understand
 why it's necessary to verify if the token ranges overlap to decide if a
 tombstone compaction must be executed on a single SSTable with high
 droppable tombstone ratio.

 Any clarification would be kindly appreciated.

 PS: Cassandra version: 1.2.16

 --
 *Paulo Motta*

 Chaordic | *Platform*
 *www.chaordic.com.br http://www.chaordic.com.br/*
 +55 48 3232.3200




-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Automatic tombstone removal issue (STCS)

2014-05-05 Thread Paulo Ricardo Motta Gomes
Hello,

After noticing that automatic tombstone removal (CASSANDRA-3442) was not
working in an append-only STCS CF with 40% of droppable tombstone ratio I
investigated why the compaction was not being triggered in the largest
SSTable with 16GB and about 70% droppable tombstone ratio.

When the code goes to check if the SSTable is candidate to be compacted
(AbstractCompactionStrategy.worthDroppingTombstones), it verifies if all
the others SSTables overlap with the current SSTable by checking if the
start and end tokens overlap. The problem is that all SSTables contain
pretty much the whole node token range, so all of them overlap nearly all
the time, so the automatic tombstone removal never happens. Is there any
case in STCS where all sstables token ranges DO NOT overlap?

I understand during the tombstone removal process it's necessary to verify
if the compacted row exists in any other SSTable, but I don't understand
why it's necessary to verify if the token ranges overlap to decide if a
tombstone compaction must be executed on a single SSTable with high
droppable tombstone ratio.

Any clarification would be kindly appreciated.

PS: Cassandra version: 1.2.16

-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Is a hint stored when a mutation is dropped?

2014-04-24 Thread Paulo Ricardo Motta Gomes
The official docs say that dropped mutations are only fixed by Read Repair
and Anti-entropy (http://wiki.apache.org/cassandra/FAQ#dropped_messages).
However, in this thread (
http://grokbase.com/t/cassandra/user/1235ctdbca/mutation-dropped-messages)
Aaron Morton says that Hinted Handoff also repairs dropped mutations, but I
couldn't find more info on that. Is this still the behavior on 1.2+?

To illustrate:

If I write with RF=2, CL=ONE: one mutation is accepted, the write returns
and the other mutation is dropped. Does the coordinator store a hint of the
dropped replica? Even without running repair, will I be able to read that
write from the dropped replica in 30 minutes?

Cheers,

-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: clearing tombstones?

2014-04-11 Thread Paulo Ricardo Motta Gomes
I have a similar problem here, I deleted about 30% of a very large CF using
LCS (about 80GB per node), but still my data hasn't shrinked, even if I
used 1 day for gc_grace_seconds. Would nodetool scrub help? Does nodetool
scrub forces a minor compaction?

Cheers,

Paulo


On Fri, Apr 11, 2014 at 12:12 PM, Mark Reddy mark.re...@boxever.com wrote:

 Yes, running nodetool compact (major compaction) creates one large
 SSTable. This will mess up the heuristics of the SizeTiered strategy (is
 this the compaction strategy you are using?) leading to multiple 'small'
 SSTables alongside the single large SSTable, which results in increased
 read latency. You will incur the operational overhead of having to manage
 compactions if you wish to compact these smaller SSTables. For all these
 reasons it is generally advised to stay away from running compactions
 manually.

 Assuming that this is a production environment and you want to keep
 everything running as smoothly as possible I would reduce the gc_grace on
 the CF, allow automatic minor compactions to kick in and then increase the
 gc_grace once again after the tombstones have been removed.


 On Fri, Apr 11, 2014 at 3:44 PM, William Oberman ober...@civicscience.com
  wrote:

 So, if I was impatient and just wanted to make this happen now, I could:

 1.) Change GCGraceSeconds of the CF to 0
 2.) run nodetool compact (*)
 3.) Change GCGraceSeconds of the CF back to 10 days

 Since I have ~900M tombstones, even if I miss a few due to impatience, I
 don't care *that* much as I could re-run my clean up tool against the now
 much smaller CF.

 (*) A long long time ago I seem to recall reading advice about don't
 ever run nodetool compact, but I can't remember why.  Is there any bad
 long term consequence?  Short term there are several:
 -a heavy operation
 -temporary 2x disk space
 -one big SSTable afterwards
 But moving forward, everything is ok right?
  CommitLog/MemTable-SStables, minor compactions that merge SSTables,
 etc...  The only flaw I can think of is it will take forever until the
 SSTable minor compactions build up enough to consider including the big
 SSTable in a compaction, making it likely I'll have to self manage
 compactions.



 On Fri, Apr 11, 2014 at 10:31 AM, Mark Reddy mark.re...@boxever.comwrote:

 Correct, a tombstone will only be removed after gc_grace period has
 elapsed. The default value is set to 10 days which allows a great deal of
 time for consistency to be achieved prior to deletion. If you are
 operationally confident that you can achieve consistency via anti-entropy
 repairs within a shorter period you can always reduce that 10 day interval.


 Mark


 On Fri, Apr 11, 2014 at 3:16 PM, William Oberman 
 ober...@civicscience.com wrote:

 I'm seeing a lot of articles about a dependency between removing
 tombstones and GCGraceSeconds, which might be my problem (I just checked,
 and this CF has GCGraceSeconds of 10 days).


 On Fri, Apr 11, 2014 at 10:10 AM, tommaso barbugli tbarbu...@gmail.com
  wrote:

 compaction should take care of it; for me it never worked so I run
 nodetool compaction on every node; that does it.


 2014-04-11 16:05 GMT+02:00 William Oberman ober...@civicscience.com:

 I'm wondering what will clear tombstoned rows?  nodetool cleanup,
 nodetool repair, or time (as in just wait)?

 I had a CF that was more or less storing session information.  After
 some time, we decided that one piece of this information was pointless to
 track (and was 90%+ of the columns, and in 99% of those cases was ALL
 columns for a row).   I wrote a process to remove all of those columns
 (which again in a vast majority of cases had the effect of removing the
 whole row).

 This CF had ~1 billion rows, so I expect to be left with ~100m rows.
  After I did this mass delete, everything was the same size on disk 
 (which
 I expected, knowing how tombstoning works).  It wasn't 100% clear to me
 what to poke to cause compactions to clear the tombstones.  First I tried
 nodetool cleanup on a candidate node.  But, afterwards the disk usage was
 the same.  Then I tried nodetool repair on that same node.  But again, 
 disk
 usage is still the same.  The CF has no snapshots.

 So, am I misunderstanding something?  Is there another operation to
 try?  Do I have to just wait?  I've only done cleanup/repair on one 
 node.
  Do I have to run one or the other over all nodes to clear tombstones?

 Cassandra 1.2.15 if it matters,

 Thanks!

 will











-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: clearing tombstones?

2014-04-11 Thread Paulo Ricardo Motta Gomes
This thread is really informative, thanks for the good feedback.

My question is : Is there a way to force tombstones to be clared with LCS?
Does scrub help in any case? Or the only solution would be to create a new
CF and migrate all the data if you intend to do a large CF cleanup?

Cheers,


On Fri, Apr 11, 2014 at 2:02 PM, Mark Reddy mark.re...@boxever.com wrote:

 Thats great Will, if you could update the thread with the actions you
 decide to take and the results that would be great.


 Mark


 On Fri, Apr 11, 2014 at 5:53 PM, William Oberman ober...@civicscience.com
  wrote:

 I've learned a *lot* from this thread.  My thanks to all of the
 contributors!

 Paulo: Good luck with LCS.  I wish I could help there, but all of my CF's
 are SizeTiered (mostly as I'm on the same schema/same settings since 0.7...)

 will



 On Fri, Apr 11, 2014 at 12:14 PM, Mina Naguib mina.nag...@adgear.comwrote:


 Levelled Compaction is a wholly different beast when it comes to
 tombstones.

 The tombstones are inserted, like any other write really, at the lower
 levels in the leveldb hierarchy.

 They are only removed after they have had the chance to naturally
 migrate upwards in the leveldb hierarchy to the highest level in your data
 store.  How long that takes depends on:
  1. The amount of data in your store and the number of levels your LCS
 strategy has
 2. The amount of new writes entering the bottom funnel of your leveldb,
 forcing upwards compaction and combining

 To give you an idea, I had a similar scenario and ran a (slow,
 throttled) delete job on my cluster around December-January.  Here's a
 graph of the disk space usage on one node.  Notice the still-diclining
 usage long after the cleanup job has finished (sometime in January).  I
 tend to think of tombstones in LCS as little bombs that get to explode much
 later in time:

 http://mina.naguib.ca/images/tombstones-cassandra-LCS.jpg



 On 2014-04-11, at 11:20 AM, Paulo Ricardo Motta Gomes 
 paulo.mo...@chaordicsystems.com wrote:

 I have a similar problem here, I deleted about 30% of a very large CF
 using LCS (about 80GB per node), but still my data hasn't shrinked, even if
 I used 1 day for gc_grace_seconds. Would nodetool scrub help? Does nodetool
 scrub forces a minor compaction?

 Cheers,

 Paulo


 On Fri, Apr 11, 2014 at 12:12 PM, Mark Reddy mark.re...@boxever.comwrote:

 Yes, running nodetool compact (major compaction) creates one large
 SSTable. This will mess up the heuristics of the SizeTiered strategy (is
 this the compaction strategy you are using?) leading to multiple 'small'
 SSTables alongside the single large SSTable, which results in increased
 read latency. You will incur the operational overhead of having to manage
 compactions if you wish to compact these smaller SSTables. For all these
 reasons it is generally advised to stay away from running compactions
 manually.

 Assuming that this is a production environment and you want to keep
 everything running as smoothly as possible I would reduce the gc_grace on
 the CF, allow automatic minor compactions to kick in and then increase the
 gc_grace once again after the tombstones have been removed.


 On Fri, Apr 11, 2014 at 3:44 PM, William Oberman 
 ober...@civicscience.com wrote:

 So, if I was impatient and just wanted to make this happen now, I
 could:

 1.) Change GCGraceSeconds of the CF to 0
 2.) run nodetool compact (*)
 3.) Change GCGraceSeconds of the CF back to 10 days

 Since I have ~900M tombstones, even if I miss a few due to impatience,
 I don't care *that* much as I could re-run my clean up tool against the 
 now
 much smaller CF.

 (*) A long long time ago I seem to recall reading advice about don't
 ever run nodetool compact, but I can't remember why.  Is there any bad
 long term consequence?  Short term there are several:
 -a heavy operation
 -temporary 2x disk space
 -one big SSTable afterwards
 But moving forward, everything is ok right?
  CommitLog/MemTable-SStables, minor compactions that merge SSTables,
 etc...  The only flaw I can think of is it will take forever until the
 SSTable minor compactions build up enough to consider including the big
 SSTable in a compaction, making it likely I'll have to self manage
 compactions.



 On Fri, Apr 11, 2014 at 10:31 AM, Mark Reddy 
 mark.re...@boxever.comwrote:

 Correct, a tombstone will only be removed after gc_grace period has
 elapsed. The default value is set to 10 days which allows a great deal of
 time for consistency to be achieved prior to deletion. If you are
 operationally confident that you can achieve consistency via anti-entropy
 repairs within a shorter period you can always reduce that 10 day 
 interval.


 Mark


 On Fri, Apr 11, 2014 at 3:16 PM, William Oberman 
 ober...@civicscience.com wrote:

 I'm seeing a lot of articles about a dependency between removing
 tombstones and GCGraceSeconds, which might be my problem (I just 
 checked,
 and this CF has GCGraceSeconds of 10 days).


 On Fri, Apr 11

Blog post with Cassandra upgrade tips

2014-04-11 Thread Paulo Ricardo Motta Gomes
Hey,

Some months ago (last year!!) during our previous major upgrade from 1.1 -
1.2 I started writing a blog post with some tips for a smooth rolling
upgrade, but for some reason I forgot to finish the post. I found it
recently and decided it to publish anyway, as some of the info may be
helpful for future major upgrades:

http://monkeys.chaordic.com.br/operation/zero-downtime-cassandra-upgrade/

Cheers,

-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: using hadoop + cassandra for CF mutations (delete)

2014-04-04 Thread Paulo Ricardo Motta Gomes
You said you have tried the Pig URL split_size, but have you actually tried
decreasing the value of cassandra.input.split.size hadoop property? The
default is 65536, so you may want to decrease that to see if the number of
mappers increase. But at some point, even if you lower that value it will
stop decreasing the number of mappers but I don't know exactly why,
probably because it hits the minimum number of rows per token.

Another suggestion is to decrease the number of simultaneous mappers of
your job, so it doesn't hit cassandra too hard, and you'll get less
TimedOutExceptions, but your job will take longer to complete.

On Fri, Apr 4, 2014 at 1:24 PM, William Oberman ober...@civicscience.comwrote:

 Hi,

 I have some history with cassandra + hadoop:
 1.) Single DC + integrated hadoop = Was ok until I needed steady
 performance (the single DC was used in a production environment)
 2.) Two DC's + integrated hadoop on 1 of 2 DCs = Was ok until my data
 grew and in AWS compute is expensive compared to data storage... e.g.
 running a 24x7 DC was a lot more expensive than the following solution...
 3.) Single DC + a constant ETL to S3 = Is still ok, I can spawn an
 arbitrarily large EMR cluster.  And 24x7 data storage + transient EMR is
 cost effective.

 But, one of my CF's has had a change of usage pattern making a large %,
 but not all of the data, fairly pointless to store.  I thought I'd write a
 Pig UDF that could peek at a row of data and delete if it fails my
 criteria.  And it works in terms of logic, but not in terms of practical
 execution.  The CF in question has O(billion) keys, and afterwards it will
 have ~10% of that at most.

 I basically keep losing the jobs due to too many task failures, all rooted
 in:
 Caused by: TimedOutException()
 at
 org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:13020)

 And yes, I've messed around with:
 -Number of failures for map/reduce/tracker (in the hadoop confs)
 -split_size (on the URL)
 -cassandra.range.batch.size

 But it hasn't helped.  My failsafe is to roll my own distributed process,
 rather than falling into a pit of internal hadoop settings.  But I feel
 like I'm close.

 The problem in my opinion, watching how things are going, is the
 correlation of splits - tasks.  I'm obviously using Pig, so this part of
 the process is fairly opaque to me at the moment.  But, something
 somewhere is picking 20 tasks for my job, and this is fairly independent
 of the # of task slots (I've booted EMR cluster with different #'s and
 always get 20).  Why does this matter?  When a task fails, it retries from
 the start, which is a killer for me as I delete as I go, making that
 pointless work and massively increasing the odds of an overall job failure.
  If hadoop/pig chose a large number of tasks, the retries would be much
 less of a burden.  But, I don't see where/what lets me mess with that logic.

 Pig gives the ability to mess with reducers (PARALLEL), but I'm in the
 load path, which is all mappers.  I've never jumped to the lower, raw
 hadoop level before.  But, I'm worried that will be the falling into a
 pit issue...

 I'm using Cassandra 1.2.15.

 will




-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200
+55 83 9690-1314


sstable partitioner converter tool

2014-03-20 Thread Paulo Ricardo Motta Gomes
Hello,

We wanted to migrate our data from a RandomPartitioner cluster to
a Murmur3Partitioner cluster via sstableloader, but it does not support
directly loading sstables to a cluster with a different partitioner.

We didn't find any tool that performs the conversion between sstables from
different partitioners, so we put together some C* code and built our own.
After the sstable conversion is done it's possible to bulk load the data
into the new cluster with sstableloader.

The tool supports sstables from C* 1.2 and 2.0 and is available on github,
so feel free to use it and contribute:
https://github.com/chaordic/sstableconverter

Cheers,

-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*


Re: Dead node seen as UP by replacement node

2014-03-14 Thread Paulo Ricardo Motta Gomes
Hmm, we considered that option but if the old node is assassinated, his
range will be assigned to a neighbor that doesn't have the data, what will
cause empty reads. What we did to solve the problem was to do a safe
removal via nodetool removenode deadNodeId, wait some hours for
neighbors to stream that node's data, and then bootstrap the replacement
node. However this procedure takes double the time, because data needs to
be streamed twice, which is not really optimal.

It would be really nice to know if this is expected behavior of if I should
fill a bug request.


On Fri, Mar 14, 2014 at 11:59 AM, Rahul Menon ra...@apigee.com wrote:

 Since the older node is not available i would ask you to assassinate the
 old node and then get the node new node to bootstrap.


 On Thu, Mar 13, 2014 at 10:56 PM, Paulo Ricardo Motta Gomes 
 paulo.mo...@chaordicsystems.com wrote:

 Yes, exactly.


 On Thu, Mar 13, 2014 at 1:27 PM, Rahul Menon ra...@apigee.com wrote:

 And the token value as suggested is tokenvalueoddeadnode-1 ?


 On Thu, Mar 13, 2014 at 9:29 PM, Paulo Ricardo Motta Gomes 
 paulo.mo...@chaordicsystems.com wrote:

 Nope, they have different IPs. I'm using the procedure described here
 to replace a dead node:
 http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node

 Dead node token: X (IP: Y)
 Replacement node token: X-1 (IP: Z)

 So, as soon as the replacement node (Z) is started, it sees the dead
 node (Y) as UP, and tries to stream data from it during the join process.
 About 10 minutes later, the failure detector of Z detects Y as down, but
 since it was trying to fetch data from him, it fails the join/bootstrap
 process altogether.





 --
 *Paulo Motta*

 Chaordic | *Platform*
 *www.chaordic.com.br http://www.chaordic.com.br/*
 +55 48 3232.3200
 +55 83 9690-1314





-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200
+55 83 9690-1314


Cannot bootstrap replacement node

2014-03-14 Thread Paulo Ricardo Motta Gomes
Hello,

I'm having some trouble during bootstrap of a replacement node and I'm
suspecting it could be a bug in Cassandra. I'm using C* 1.2.13, RF=2, with
Vnodes disabled. Below is a simplified version of my ring:

* n1 : token 100
* n2 : token 200 (DEAD)
* n3 : token 300
* n4 : token 0

n2 has died, so I tried bootstraping a new replacement node:

* x : token 199 (n2.token-1)

Even though n2 was terminated, and being seen as DOWN by n1, n3 and n4, the
replacement node x was seeing n2 as UP, immediately trying to stream data
from it during bootstrap. After about 10 minutes, when x detected n2 as
DOWN, the bootstrap failed for obvious reasons.

Since the previous procedure did not work, I tried the next procedure for
replacing n2:

- Remove n2 from the ring. This makes n3 stream n2's data to n1.
- After the leave is complete, try to bootstrap X again.

Ideally, x would stream data from n1 and n3, but it always streams data
only from n3. The problem is that at some point n3 is seen as DOWN by x,
failing the bootstrap process again.

I suspect there is some kind of inconsistency in the gossip information of
n2 that is preventing x from streaming data from both n1 and n3. I tried
purging n2 from gossip, using Gossiper.unsafeAssassinateEndpoint() via JMX,
but I'm getting the following error:

*Problem invoking unsafeAssassinateEndpoint :
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0*

My next and last approach is to manually copy the sstables via rsync from
n3 and start x with auto_bootstrap=false, but I really didn't want to use
this approach. Is it so hard to bootstrap a new node when not using Vnodes
in C* 1.2, or this could be hiding some kind of bug? Any feedback would be
greatly appreciated.

Cheers,

-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*


Re: Dead node seen as UP by replacement node

2014-03-13 Thread Paulo Ricardo Motta Gomes
Yes, exactly.


On Thu, Mar 13, 2014 at 1:27 PM, Rahul Menon ra...@apigee.com wrote:

 And the token value as suggested is tokenvalueoddeadnode-1 ?


 On Thu, Mar 13, 2014 at 9:29 PM, Paulo Ricardo Motta Gomes 
 paulo.mo...@chaordicsystems.com wrote:

 Nope, they have different IPs. I'm using the procedure described here to
 replace a dead node:
 http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node

 Dead node token: X (IP: Y)
 Replacement node token: X-1 (IP: Z)

 So, as soon as the replacement node (Z) is started, it sees the dead node
 (Y) as UP, and tries to stream data from it during the join process. About
 10 minutes later, the failure detector of Z detects Y as down, but since it
 was trying to fetch data from him, it fails the join/bootstrap process
 altogether.





-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200
+55 83 9690-1314


Dead node seen as UP by replacement node

2014-03-12 Thread Paulo Ricardo Motta Gomes
Hello,

I'm trying to replace a dead node using the procedure in [1], but the
replacement node initially sees the dead node as UP, and after a few
minutes the node is marked as DOWN again, failing the streaming/bootstrap
procedure of the replacement node. This dead node is always seen as DOWN by
the rest of the cluster.

Could this be a bug? I can easily reproduce it in our production
environment, but don't know if it's reproducible in a clean environment.

Version: 1.2.13

Here is the log from the replacement node (192.168.1.10 is the dead node):

 INFO [GossipStage:1] 2014-03-12 20:25:41,089 Gossiper.java (line 843) Node
/192.168.1.10 is now part of the cluster
 INFO [GossipStage:1] 2014-03-12 20:25:41,090 Gossiper.java (line 809)
InetAddress /192.168.1.10 is now UP
 INFO [GossipTasks:1] 2014-03-12 20:34:54,238 Gossiper.java (line 823)
InetAddress /192.168.1.10 is now DOWN
ERROR [GossipTasks:1] 2014-03-12 20:34:54,240 AbstractStreamSession.java
(line 110) Stream failed because /192.168.1.10 died or was
restarted/removed (streams may still be active in background, but further
streams won't be started)
 WARN [GossipTasks:1] 2014-03-12 20:34:54,240 RangeStreamer.java (line 246)
Streaming from /192.168.1.10 failed
ERROR [GossipTasks:1] 2014-03-12 20:34:54,240 AbstractStreamSession.java
(line 110) Stream failed because /192.168.1.10 died or was
restarted/removed (streams may still be active in background, but further
streams won't be started)
 WARN [GossipTasks:1] 2014-03-12 20:34:54,241 RangeStreamer.java (line 246)
Streaming from /192.168.1.10 failed

[1]
http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node

Cheers,

Paulo

-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200
+55 83 9690-1314


Re: Dead node seen as UP by replacement node

2014-03-12 Thread Paulo Ricardo Motta Gomes
Some further info:

I'm not using Vnodes, so I'm using the 1.1 replace node trick of setting
the initial_token in the cassandra.yaml file to the value of the dead
node's token -1, and autobootstrap=true. However, according to the Apache
wiki (
https://wiki.apache.org/cassandra/Operations#For_versions_1.2.0_and_above),
on 1.2 you should actually remove the dead node from the ring, before
adding a replacement node.

Does that mean the trick of setting the initial token to the value of the
dead node's -1 (described in
http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node) is
not valid anymore in 1.2 without vnodes?


On Wed, Mar 12, 2014 at 5:57 PM, Paulo Ricardo Motta Gomes 
paulo.mo...@chaordicsystems.com wrote:

 Hello,

 I'm trying to replace a dead node using the procedure in [1], but the
 replacement node initially sees the dead node as UP, and after a few
 minutes the node is marked as DOWN again, failing the streaming/bootstrap
 procedure of the replacement node. This dead node is always seen as DOWN by
 the rest of the cluster.

 Could this be a bug? I can easily reproduce it in our production
 environment, but don't know if it's reproducible in a clean environment.

 Version: 1.2.13

 Here is the log from the replacement node (192.168.1.10 is the dead node):

  INFO [GossipStage:1] 2014-03-12 20:25:41,089 Gossiper.java (line 843)
 Node /192.168.1.10 is now part of the cluster
  INFO [GossipStage:1] 2014-03-12 20:25:41,090 Gossiper.java (line 809)
 InetAddress /192.168.1.10 is now UP
  INFO [GossipTasks:1] 2014-03-12 20:34:54,238 Gossiper.java (line 823)
 InetAddress /192.168.1.10 is now DOWN
 ERROR [GossipTasks:1] 2014-03-12 20:34:54,240 AbstractStreamSession.java
 (line 110) Stream failed because /192.168.1.10 died or was
 restarted/removed (streams may still be active in background, but further
 streams won't be started)
  WARN [GossipTasks:1] 2014-03-12 20:34:54,240 RangeStreamer.java (line
 246) Streaming from /192.168.1.10 failed
 ERROR [GossipTasks:1] 2014-03-12 20:34:54,240 AbstractStreamSession.java
 (line 110) Stream failed because /192.168.1.10 died or was
 restarted/removed (streams may still be active in background, but further
 streams won't be started)
  WARN [GossipTasks:1] 2014-03-12 20:34:54,241 RangeStreamer.java (line
 246) Streaming from /192.168.1.10 failed

 [1]
 http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node

 Cheers,

 Paulo

 --
 *Paulo Motta*

 Chaordic | *Platform*
 *www.chaordic.com.br http://www.chaordic.com.br/*
 +55 48 3232.3200
 +55 83 9690-1314




-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200
+55 83 9690-1314


Re: Bootstrap failure on C* 1.2.13

2014-02-14 Thread Paulo Ricardo Motta Gomes
Hello Alain,

I solved this with a brute force solution, but didn't understand exactly
what happened behind the scenes. What I did was:

a) removed the failed node from the ring with the unsafeAssassinate JMX
option.
b) this caused requests to that node to be routed to the following node
which didn't have the data, so in order to fix the problem I inserted a new
dummy node with the same token as the failed node, but with
autobootstrap=false
c) after the node joined the ring again, I did a clean shutdown with
nodetool -h localhost disablethrift
nodetool -h localhost disablegossip  sleep 10
nodetool -h localhost drain
d) restart the bootstrap process again in the new node.

But in our case, our cluster was not using VNodes, so this workaround will
probably not work with VNodes, since you cannot specify the 256 tokens from
the old node.

This really seem like some kind of metadata inconsistency in gossip, so you
probably should check if your nodetool gossipinfo shows a node that's not
supposed to be in the ring and unsafeAssassinate it. This post has more
info about it: http://nartax.com/2012/09/assassinate-cassandra-node/

But be careful to know what you're doing, as this can be a dangerous
operation.

Good luck!

Cheers,

Paulo




On Fri, Feb 14, 2014 at 11:17 AM, Alain RODRIGUEZ arodr...@gmail.comwrote:

 Hi Paulo,

 Did you find out how to fix this issue ? I am experimenting the exact same
 issue after trying to help you on this exact subject a few days ago :).

 Config : 32 C*1.2.11 nodes, Vnodes enabled, RF=3, 1 DC, On AWS EC2
 m1.xlarge.

 We added a few nodes (4) and it seems that this occurs on one node out of
 two...

 INFO 12:52:16,889 Finished streaming session
 d5e4d014-9558-11e3-950d-cd6aba92807e from /xxx.xxx.xxx.xxx
 java.lang.RuntimeException: Unable to fetch range
 [(20078703525355016727168231761171377180,20105424945623564908585534414693308183],
 (129753652951782325468767616123724624016,129754698153613057562227134647005586420],
 (449910615740630024413140540076738,4524540663392564361402125588359485564],
 (122461441134035840782923349842361962551,122462803389597917496737056756119104930],
 (107970238065835199457922160357012606207,107987706615224138615506976884972465320],
 (129754698153613057562227134647005586420,129760990520285412763184172827801136526],
 (38338043252657275110873170917842646549,38368318768493907804399955985800320618],
 (42022774431506526693485667522039962965,42053289032932587102300879230918436885],
 (66836265760288088017242608238099612345,66844191330959602627129212011239690831],
 (52540232739182066369547232798226785314,52559117354438503565212218200939569114],
 (145046787539667961591986998676504957238,145057153206926436867917708334845130444],
 (108279691586280658015556401795266720050,108305470056478513440634738885678702409],
 (40039571254531814244837067525035822613,40053379084508254942645157728035688263],
 (132027653159543236812527609067336099062,132029648290617316887203744857701890860],
 (52516518106546460227349801041398186304,52540232739182066369547232798226785314],
 (151797253868519929321029931533765036527,151828244658375264200603444399788004805],
 (145057153206926436867917708334845130444,145084033851007428646660791831082771964],
 (107963567982152736714636832273817259428,107970238065835199457922160357012606207]]
 for keyspace foo_bar from any hosts

 at org.apache.cassandra.dht.RangeStreamer.fetch(RangeStreamer.java:260)
 at org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:84)
 at
 org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:973)
 at
 org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:740)
 at
 org.apache.cassandra.service.StorageService.initServer(StorageService.java:584)
 at
 org.apache.cassandra.service.StorageService.initServer(StorageService.java:481)
 at
 org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348)
 at
 org.apache.cassandra.service.CassandraDaemon.init(CassandraDaemon.java:381)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at
 org.apache.commons.daemon.support.DaemonLoader.load(DaemonLoader.java:212)

 Cannot load daemon

 Service exit with a return value of 3

 Hope you'll be able to help me on this one :)


 2014-02-07 19:24 GMT+01:00 Robert Coli rc...@eventbrite.com:

 On Fri, Feb 7, 2014 at 4:41 AM, Alain RODRIGUEZ arodr...@gmail.comwrote:

 From changelog :



 1.2.15
  * Move handling of migration event source to solve bootstrap race 
 (CASSANDRA-6648)

 Maybe should you give this new version a try, if you suspect your issue to 
 be related to CASSANDRA-6648.

 6648 appears to have been introduced in 1.2.14, by :

 https://issues.apache.org/jira/browse/CASSANDRA-6615

 So it should only affect 1.2.14.

 =Rob






non-vnodes own 0.0% of the ring on nodetool status

2014-02-12 Thread Paulo Ricardo Motta Gomes
Hello,

After adding a new datacenter with virtual nodes enabled, the output of
nodetool status shows that nodes from the non-vnodes datacenter owns 0.0%
of the data, as shown below:

Datacenter: NonVnodesDC
=
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address Load   Tokens  Owns   Host ID
Rack
UN  XX.XXX.XX.XXmany GB  1  0.1%
  myrack
UN  YY.YYY.YY.YYmany GB  1  0.0%
  myrack
UN  ZZ.ZZZ.ZZ.ZZmany GB  1  0.0%
  myrack

Datacenter: VnodesDC
==
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address Load   Tokens  Owns   Host ID
Rack
UN  AA.AAA.AA.AAfew KB  256 5.8%
  myrack
UN  BB.BBB.BB.BBfew KB  256 6.6%
  myrack
UN  CC.CCC.CC.CCfew KB  256 6.9%
  myrack


Is this a presentation issue on nodetool, or could mean a more serious
thing? I did exactly the procedure described below to add a new DC: in
http://www.datastax.com/documentation/cassandra/1.2/webhelp/cassandra/operations/ops_add_dc_to_cluster_t.html.

Thank you,

-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200
+55 83 9690-1314


Re: Question: ConsistencyLevel.ONE with multiple datacenters

2014-02-06 Thread Paulo Ricardo Motta Gomes
Cool. I actually changed the consistency level to LOCAL_ONE and things
worked as expected.

Cheers!


On Thu, Feb 6, 2014 at 11:31 AM, Chris Burroughs
chris.burrou...@gmail.comwrote:

 I think the scenario you outlined is correct.  The DES handles multiple
 DCs poorly and the LOCAL_ONE hammer is the best bet.


 On 01/31/2014 12:40 PM, Paulo Ricardo Motta Gomes wrote:

 Hey,

 When adding a new data center to our production C* datacenter using the
 procedure described in [1], some of our application requests were
 returning
 null/empty values. Rebuild was not complete in the new datacenter, so my
 guess is that some requests were being directed to the brand new
 datacenter
 which still didn't have the data.

 Our Hector client was connected only to the original nodes, with
 autoDiscoverHosts=false and we use ConsistencyLevel.ONE for reads. The
 keyspace schema was already configured to use both data centers.

 My question is: is it possible that the dynamic snitch is choosing the
 nodes in the new (empty) datacenter when CL=ONE? In this case, it's
 mandatory to use CL=LOCAL_ONE during bootstrap/rebuild of a new
 datacenter,
 otherwise empty data might be returned, correct?

 Cheers,

 [1]
 http://www.datastax.com/documentation/cassandra/1.2/
 webhelp/cassandra/operations/ops_add_dc_to_cluster_t.html





-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200
+55 83 9690-1314


Re: Adding datacenter for move to vnodes

2014-02-02 Thread Paulo Ricardo Motta Gomes
We had a similar situation and what we did was first migrate the 1.1
cluster to GossipingPropertyFileSnitch, making sure that for each node we
specified the correct availability zone as the rack in
the cassandra-rackdc.properties. In this way,
the GossipingPropertyFileSnitch is equivalent to the EC2MultiRegionSnitch,
so the data location does not change and no repair is needed afterwards.
So, if your nodes are located in the us-east-1e AZ, your
cassandra-rackdc.properties
should look like:

dc=us-east
rack=1e

After this step is complete on all nodes, then you can add a new datacenter
specifying different dc and rack on the cassandra-rackdc.properties of the
new DC. Make sure you upgrade your initial datacenter to 1.2 before adding
a new datacenter with vnodes enabled (of course).

Cheers


On Sun, Feb 2, 2014 at 6:37 AM, Katriel Traum katr...@google.com wrote:

 Hello list.

 I'm upgrading a 1.1 cassandra cluster to 1.2(.13).
 I've read here and in other places that the best way to migrate to vnodes
 is to add a new DC, with the same amount of nodes, and run rebuild on each
 of them.
 However, I'm faced with the fact that I'm using EC2MultiRegion snitch,
 which automagically creates the DC and RACK.

 Any ideas how I can go about adding a new DC with this kind of setup? I
 need these new machines to be in the same EC2 Region as the current ones,
 so adding to a new Region is not an option.

 TIA,
 Katriel




-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200
+55 83 9690-1314


Question: ConsistencyLevel.ONE with multiple datacenters

2014-01-31 Thread Paulo Ricardo Motta Gomes
Hey,

When adding a new data center to our production C* datacenter using the
procedure described in [1], some of our application requests were returning
null/empty values. Rebuild was not complete in the new datacenter, so my
guess is that some requests were being directed to the brand new datacenter
which still didn't have the data.

Our Hector client was connected only to the original nodes, with
autoDiscoverHosts=false and we use ConsistencyLevel.ONE for reads. The
keyspace schema was already configured to use both data centers.

My question is: is it possible that the dynamic snitch is choosing the
nodes in the new (empty) datacenter when CL=ONE? In this case, it's
mandatory to use CL=LOCAL_ONE during bootstrap/rebuild of a new datacenter,
otherwise empty data might be returned, correct?

Cheers,

[1]
http://www.datastax.com/documentation/cassandra/1.2/webhelp/cassandra/operations/ops_add_dc_to_cluster_t.html

-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200
+55 83 9690-1314