Re: General question regarding bootstrap and nodetool repair

2013-01-31 Thread Rob Coli
On Thu, Jan 31, 2013 at 12:19 PM, Wei Zhu wz1...@yahoo.com wrote:
 But I am still not sure how about the my first question regarding the
 bootstrap, anyone?

As I understand it, bootstrap occurs from a single replica. Which
replica is chosen is based on some internal estimation of which is
closest/least loaded/etc. But only from a single replica, so in RF=3,
in order to be consistent with both you still have to run a repair.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: initial_token

2013-01-31 Thread Rob Coli
On Thu, Jan 31, 2013 at 12:17 PM, Edward Capriolo edlinuxg...@gmail.com wrote:
 Now by default a new partitioner is chosen Murmer3.

Now = as of 1.2, to be unambiguous.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: General question regarding bootstrap and nodetool repair

2013-01-31 Thread Rob Coli
On Thu, Jan 31, 2013 at 3:31 PM, Wei Zhu wz1...@yahoo.com wrote:
 The only reason I can think of is that the new node has the same IP as the
 dead node we tried to replace? After reading the bootstrap code, it
 shouldn't be the case. Is it a bug? Or anyone tried to replace a dead node
 with the same IP?

You can use replace_token property to accomplish this. I would expect
cassandra to get confused by having two nodes with the same ip.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Suggestion: Move some threads to the client-dev mailing list

2013-01-30 Thread Rob Coli
On Wed, Jan 30, 2013 at 7:21 AM, Edward Capriolo edlinuxg...@gmail.com wrote:
 My suggestion: At minimum we should re-route these questions to client-dev
 or simply say, If it is not part of core Cassandra, you are looking in the
 wrong place for support

+1, I find myself scanning past all those questions in order to find
questions I am able to answer based solely on my operational knowledge
of the Cassandra daemon.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Cassandra timeout whereas it is not much busy

2013-01-22 Thread Rob Coli
On Wed, Jan 16, 2013 at 1:30 PM, Nicolas Lalevée
nicolas.lale...@hibnet.org wrote:
 Here is the long story.
 After some long useless staring at the monitoring graphs, I gave a try to
 using the openjdk 6b24 rather than openjdk 7u9

OpenJDK 6 and 7 are both counter-recommended with regards to
Cassandra. I've heard reports of mysterious behavior like the behavior
you describe, when using OpenJDK 7.

Try using the Sun/Oracle JVM? Is your JNA working?

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: node down = log explosion?

2013-01-22 Thread Rob Coli
On Tue, Jan 22, 2013 at 5:03 AM, Sergey Olefir solf.li...@gmail.com wrote:
 I am load-testing counter increments at the rate of about 10k per second.

Do you need highly performant counters that count accurately, without
meaningful chance of over-count? If so, Cassandra's counters are
probably not ideal.

 We wanted to test what happens if one node goes down, so we brought one node
 down in DC1 (i.e. the node that was handling half of the incoming writes).
 ...
 This led to a complete explosion of logs on the remaining alive node in DC1.

I agree, this level of exception logging during replicateOnWrite
(which is called every time a counter is incremented) seems like a
bug. I would file a bug at the Apache JIRA.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: node down = log explosion?

2013-01-22 Thread Rob Coli
On Tue, Jan 22, 2013 at 2:57 PM, Sergey Olefir solf.li...@gmail.com wrote:
 Do you have a suggestion as to what could be a better fit for counters?
 Something that can also replicate across DCs and survive link breakdown
 between nodes (across DCs)? (and no, I don't need 100.00% precision
 (although it would be nice obviously), I just need to be pretty close for
 the values of pretty)

In that case, Cassandra counters are probably fine.

 On the subject of bug report -- I probably will -- but I'll wait a bit for
 more info here, perhaps there's some configuration or something that I just
 don't know about.

Excepting on replicateOnWrite stage seems pretty unambiguous to me,
and unexpected. YMMV?

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: is there a way to list who is connected to my cluster?

2013-01-11 Thread Rob Coli
On Fri, Jan 11, 2013 at 10:32 AM, Brian Tarbox tar...@cabotresearch.com wrote:
 I'd like to be able to find out which processes are connected to my
 clusteris there a way to do this?

No, not internally to Cassandra, short of enabling DEBUG logging for
associated classes. Use netstat or similar.

If you are interested in such a feature, please log into Cassandra's
JIRA and vote for this issue :

https://issues.apache.org/jira/browse/CASSANDRA-5084

Cassandra should expose connected client state via JMX

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Script to load sstables from v1.0.x to v 1.1.x

2013-01-08 Thread Rob Coli
On Tue, Jan 8, 2013 at 8:41 AM, Todd Nine todd.n...@gmail.com wrote:
   I have recently been trying to restore backups from a v1.0.x cluster we
 have into a 1.1.7 cluster.  This has not been as trivial as I expected, and
 I've had a lot of help from the IRC channel in tackling this problem.  As a
 way of saying thanks, I'd like to contribute the updated ruby script I was
 originally given for accomplishing this task.  Here it is.

While I laud your contribution, I am still not fully understanding why
this is not working automagically, as it should :

http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-1-flexible-data-file-placement

What about upgrading?

Do you need to manually move all pre-1.1 data files to the new
directory structure before upgrading to 1.1? No. Immediately after
Cassandra 1.1 starts, it checks to see whether it has old directory
structure and migrates all data files (including backups and
snapshots) to the new directory structure if needed. So, just upgrade
as you always do (don’t forget to read NEWS.txt first), and you will
get more control over data files for free.


Is it possible that, for example, the installation of the debian
package results in your 1.1.x node starting up before you intend it
to.. and then when you start it again with the 1.0 paths, it doesn't
try to change the paths?

 * To check if sstables needs migration, we look at the System
directory. If it contains a directory for the status cf, we'll attempt
a sstable migrating. 

This quote from Directories.java (thx driftx!) suggests that any
starting of a 1.1 node, which would result in a Status columnfamily
being created, would make sstablesNeedsMigration return false.

If this is your case due to the use of the debian package or similar
which auto-starts, your input is welcomed at :

https://issues.apache.org/jira/browse/CASSANDRA-2356

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Script to load sstables from v1.0.x to v 1.1.x

2013-01-08 Thread Rob Coli
On Tue, Jan 8, 2013 at 11:56 AM, Todd Nine todd.n...@gmail.com wrote:
 Our current production
 cluster is still on 1.0.x, so we can either fire up a 1.0.x cluster, then
 upgrade every node to accomplish this, or just use the script.

No 1.0 cluster is required to restore 1.0 directory structure to a 1.1
cluster and have the tables be migrated by Cassandra. The 1.1 node
should look at the 1.0 directory structure you just restored and
migrate it automagically.

 We also have
 a different number of nodes in stage vs production, so we'd still need to
 run a repair if we did a straight sstable copy.

This is a compelling reason to bulk load. My commentary merely points
out that if you *aren't* changing cluster size/topology, Cassandra 1.1
should be migrating the sstables for you. :)

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: replace_token versus nodetool repair

2013-01-07 Thread Rob Coli
On Mon, Jan 7, 2013 at 9:05 AM, DE VITO Dominique
dominique.dev...@thalesgroup.com wrote:
 Is nodetool repair only usable if the node to repair has a valid (= 
 up-to-date with its neighbors) schema?

If the node is in the cluster, it should have the correct schema. If
it doesn't have the correct schema, you should either wait until the
schema is received, or (if it's stuck) wipe the schema on that node
and re-join the node.

 If the data records are completely broken on a node with token, is it valid 
 to clean the (data) records and to execute replace_token=token on the 
 *same* node?

Yes.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Inter-Cluster Communication

2013-01-02 Thread Rob Coli
On Wed, Jan 2, 2013 at 4:33 AM, Everton Lima peitin.inu...@gmail.com wrote:
 I would to know if it is possible to create 2 clusters, in the first
 constain just meta-data and in the second just the real data. How the system
 will comunicate with this both cluster and one cluster communicate with
 other? Could any one help me?

Cassandra does not have a mechanism for clusters to talk to each
other. Your application could talk to both clusters, but they would be
independent of each other.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Very large HintsColumnFamily

2012-12-21 Thread Rob Coli
Before we start.. what version of cassandra?

On Fri, Dec 21, 2012 at 4:25 PM, Keith Wright kwri...@nanigans.com wrote:
 This behavior seems to occur if I do a large
 amount of data loading using that node as the coordinator node.

In general you want to use all nodes to coordinate, not a single one.

 Nodetool netstats never seems to show
 any streaming data.  With past nodes it seemed like the node eventually
 fixed itself.

That node is storing hints for other nodes it believes are or were at
some point in DOWN state. The first step to preventing this problem
from recurring is to understand why it believes/d other nodes are
down. My conjecture is that you are overloading the coordinating node
and/or other nodes with the large amount of write.

 Note that I am seeing severely degraded performance on this node when it
 attempts to compact the HintsColumnFamily to the point where I had to set
 setcompactionthroughput to 999 to ensure it doesn't run again (after which
 the node started serving requests much faster).

Depending on version, your 40gb of hints could be in one 40gb wide
row. Look at nodetool cfstats for HintsColumnFamily to determine if
this is the case.

Do you see Timed out replaying hint messages, or are the hints being
successfully delivered?

You have two broad options :

1) purge your hints and then either reload the data (if reloading it
will be idempotent) or repair -pr on every node in the cluster.
2) reduce load enough that hints will be successfully delivered,
reduce gc_grace_seconds on the hints cf to 0 and then do a major
compaction.

If I were you, I would probably do 1). The easiest way is to stop the
node and remove all sstables in the HintsColumnFamily.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: question about config leading to an unbalanced ring

2012-12-20 Thread Rob Coli
On Thu, Dec 20, 2012 at 10:18 AM, DE VITO Dominique
dominique.dev...@thalesgroup.com wrote:
 With RF=3 and NetworkTopologyStrategy, The first replica per data center is 
 placed according to the partitioner (same as with SimpleStrategy). Additional 
 replicas in the same data center are then determined by walking the ring 
 clockwise until a node in a different rack from the previous replica is 
 found.

 So, if I understand correctly the data of rack1's 5 nodes will be replicated 
 on the single node of rack2.
 And then, the node of rack1 will host all the data of the cluster.

https://issues.apache.org/jira/browse/CASSANDRA-3810

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Monitoring the number of client connections

2012-12-20 Thread Rob Coli
On Wed, Dec 19, 2012 at 4:20 PM, Hiller, Dean dean.hil...@nrel.gov wrote:
 What? I thought cassandra was using nio so thread per connection is not 
 true?

Here's the monkey test I used to verify my conjecture.

1) ps -eLf |grep jsvc |grep cassandra | wc -l # note number of threads
2) for name in {1..300}; do  cassandra-cli -h `hostname` -k validkeyspace  done
3) ps -eLf |grep jsvc | grep cassandra | wc -l # note much higher
number of threads
4) for name in {1..300}; do kill %$name done
5) ps -eLf |grep jsvc | grep cassandra | wc -l # note that thread
count drops like a rock as connections are GCed

Via aaron_morton, here's the relevant chunk of cassandra.yaml :

https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L347

# sync  - One thread per thrift connection.
..
# hsha  - Stands for half synchronous, half asynchronous.
..
# The default is sync because on Windows hsha is about 30% slower.  On Linux,
# sync/hsha performance is about the same, with hsha of course using
less memory.


So, by default Cassandra does in fact use one thread per thrift connection.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Monitoring the number of client connections

2012-12-20 Thread Rob Coli
On Thu, Dec 20, 2012 at 12:41 PM, Rob Coli rc...@palominodb.com wrote:
 So, by default Cassandra does in fact use one thread per thrift connection.

Also of note is that even with hsha, an *active* connection (where
synchronous storage backend is doing something) consumes a thread.
Some more background at :
https://issues.apache.org/jira/browse/CASSANDRA-4277

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: State of Cassandra and Java 7

2012-12-13 Thread Rob Coli
On Thu, Dec 13, 2012 at 11:43 AM, Drew Kutcharian d...@venarc.com wrote:
 With Java 6 begin EOL-ed soon 
 (https://blogs.oracle.com/java/entry/end_of_public_updates_for), what's the 
 status of Cassandra's Java 7 support? Anyone using it in production? Any 
 outstanding *known* issues?

I'd love to see an official statement from the project, due to the
sort of EOL issues you're referring to. Unfortunately previous
requests on this list for such a statement have gone unanswered.

The non-official response is that various people run in production
with Java 7 and it seems to work. :)

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Help on MMap of SSTables

2012-12-10 Thread Rob Coli
On Thu, Dec 6, 2012 at 7:36 PM, aaron morton aa...@thelastpickle.com wrote:
 So for memory mapped files, compaction can do a madvise SEQUENTIAL instead
 of current DONTNEED flag after detecting appropriate OS versions. Will this
 help?


 AFAIK Compaction does use memory mapped file access.

The history :

https://issues.apache.org/jira/browse/CASSANDRA-1470

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: reversed=true for CQL 3

2012-12-07 Thread Rob Coli
On Thu, Dec 6, 2012 at 5:26 PM, Shahryar Sedghi shsed...@gmail.com wrote:
 I am on 1.1.4 now (I can go to 1.1.6 if needed)  and apparently it is
 broken. I defined the table like this:

In general people on 1.1.x below 1.1.6 should upgrade to at least
1.1.6 ASAP, because all versions of 1.1.x before 1.1.6 have broken
Hinted Handoff.

 please let me know if I need to open a bug.

I unfortunately don't know whether what you are doing should work in
1.1.4 or not. If I were you, I'd give 1.1.6/7 a shot, and if it still
doesn't work probably file a bug. :)

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: What is substituting keys_cached column family argument

2012-12-05 Thread Rob Coli
On Wed, Dec 5, 2012 at 9:06 AM, Roman Yankin ro...@cognitivematch.com wrote:
 In Cassandra v 0.7 there was a column family property called keys_cached, now 
 it's gone and I'm struggling to understand which of the below properties it's 
 now substituted (if substituted at all)?

Key and row caches are global in modern cassandra. You opt CFs out of
the key cache, not opt in, because the default setting is keys_only
on a per-CF basis.

http://www.datastax.com/docs/1.1/configuration/node_configuration#row-cache-keys-to-save

http://www.datastax.com/docs/1.1/configuration/node_configuration#key-cache-keys-to-save

http://www.datastax.com/docs/1.1/configuration/storage_configuration#caching

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Rename cluster

2012-11-29 Thread Rob Coli
On Thu, Nov 29, 2012 at 11:56 AM, Wei Zhu wz1...@yahoo.com wrote:
 I am trying to rename a cluster by following the instruction on Wiki:
 [...]
 I have to remove the data directory in order to change the cluster name.
 Luckily it's my testing box, so no harm. Just wondering what has been
 changed not to allow the modification through cli? What is the way of
 changing the cluster name without wiping out all the data now?

The instructions on the wiki are wrong, because clients are now not
able to not able to update the system keyspace. The below ticket is an
expansion of the previous behavior, which must mean the behavior in
question dates to 0.8 or early 1.0.

https://issues.apache.org/jira/browse/CASSANDRA-3759

You don't need to remove the data directory to change the cluster
name, you only need to remove the contents of the system keyspace's
LocationInfo columnfamily, which is where Cluster Name is stored.

Full, safe, offline process to rename a cluster :
1) put new cluster name in conf files
2) stop cluster (including draining and removing commitlog if you are
in a version below 1.1.6 and cannot rely on drain to prevent log
replay)
3) move LocationInfo files out of system keyspace of all nodes
4) start cluster

I do not believe there is currently any online way to do the above
operation. The check which prevents you from writing to the system
keyspace is a keyspace level check.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: counters + replication = awful performance?

2012-11-28 Thread Rob Coli
On Tue, Nov 27, 2012 at 3:21 PM, Edward Capriolo edlinuxg...@gmail.com wrote:
 I mispoke really. It is not dangerous you just have to understand what it
 means. this jira discusses it.

 https://issues.apache.org/jira/browse/CASSANDRA-3868

Per Sylvain on the referenced ticket :


I don't disagree about the efficiency of the valve, but at what price?
'Bootstrapping a node will make you lose increments (you don't know
which ones, you don't know how many and this even if nothing goes
wrong)' is a pretty bad drawback. That is pretty much why that option
makes me uncomfortable: it does give you better performance, so people
may be tempted to use it. Now if it was only a matter of replicating
writes only through read-repair/repair, then ok, it's pretty dangerous
but it's rather easy to explain/understand the drawback (if you don't
lose a disk, you don't lose increments, and you'd better use CL.ALL or
have read_repair_chance to 1). But the fact that it doesn't work with
bootstrap/move makes me wonder if having the option at all is not
making a disservice to users.


To me anything that can be described as will make you lose increments
(you don't know which ones, you don't know how many and this even if
nothing goes wrong) and which therefore doesn't work with
bootstrap/move is correctly described as dangerous. :D

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: counters + replication = awful performance?

2012-11-28 Thread Rob Coli
On Wed, Nov 28, 2012 at 7:15 AM, Edward Capriolo edlinuxg...@gmail.com wrote:
 I may be wrong but during a bootstrap hints can be silently discarded, if
 the node they are destined for leaves the ring.

Yeah : https://issues.apache.org/jira/browse/CASSANDRA-2434

 A user like this might benefit from DANGER counters. They are not looking
 for perfection, only better performance, and the counter row keys themselves
 role over in 5 minutes anyway.

Yep, I agree that if you don't care about accurate counting, Cassandra
counters may be for you. Cassandra counters in mongo mode are even
more web scale! The unfortunate thing is that people seem to assume
that software does what it is supposed to do, and probably do not get
a great impression of said software when it doesn't. :D

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Booting up a Datacenter replication

2012-11-23 Thread Rob Coli
On Fri, Nov 23, 2012 at 11:33 AM, Darvin  Denmian
darvin.denm...@gmail.com wrote:
 But right now I need to increased the level of data redundancy ... and
 to accomplish that I'll configure 3
 new Cassandra nodes in other Data Center.

https://issues.apache.org/jira/browse/CASSANDRA-3483

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Changing placement stratgy?

2012-11-23 Thread Rob Coli
On Fri, Nov 23, 2012 at 3:33 AM, Thomas Stets thomas.st...@gmail.com wrote:
 Is there any advantage in using a different placement strategy, consigering
 that each node has all of the data anyway?

No. In your case there is no advantage to NetworkTopologyStrategy. It
is somewhat odd that you have one logical cluster in two physical
datacenters, however. That's not usually the way people do it. Of
course people don't often do RF=N either. :)

 If so, it is possible to change the placement strategy in an existing cluster?

Yes. The only practical way is to change it such that it is a NOOP. In
your case (RF=N), all changes will be a NOOP.

Once changed, however, you can use the features of the new Strategy to
your advantage. Although in your case it doesn't matter.. If you
decide to try to change your Strategy in general, be sure to design a
test that uses nodetool getendpoints to verify that replica sets say
the same.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Looking for a good Ruby client

2012-11-21 Thread Rob Coli
On Tue, Nov 20, 2012 at 11:40 PM, Timmy Turner timm.t...@gmail.com wrote:
 I thought you were going to expose the internals of CQL3 features like (wide
 rows with) complex keys and collections to CQL2 clients (which is something
 that should generally be possible, if Datastax' blog posts are accurate,
 i.e. an actual description of how things were implemented and not just a
 conceptual one).

https://issues.apache.org/jira/browse/CASSANDRA-4377

https://issues.apache.org/jira/browse/CASSANDRA-4924

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Upgrade 1.1.2 - 1.1.6

2012-11-20 Thread Rob Coli
On Mon, Nov 19, 2012 at 7:18 PM, Mike Heffner m...@librato.com wrote:
 We performed a 1.1.3 - 1.1.6 upgrade and found that all the logs replayed
 regardless of the drain.

Your experience and desire for different (expected) behavior is welcomed on :

https://issues.apache.org/jira/browse/CASSANDRA-4446

nodetool drain sometimes doesn't mark commitlog fully flushed

If every production operator who experiences this issue shares their
experience on this bug, perhaps the project will acknowledge and
address it.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Invalid argument

2012-11-20 Thread Rob Coli
On Tue, Nov 20, 2012 at 2:03 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:
] Thanks for the work around, setting disk_access_mode: standard worked.

Do you have working JNA, for reference?

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Upgrade 1.1.2 - 1.1.6

2012-11-19 Thread Rob Coli
On Thu, Nov 15, 2012 at 6:21 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:
 We had an issue with counters over-counting even using the nodetool drain
 command before upgrading...

You're sure the over-count was caused by the upgrade?Counts can be
counted on (heh) to overcount. What is the scale of the over-count?

Also, do you realize that drain doesn't actually prevent over-replay,
though it should?

https://issues.apache.org/jira/browse/CASSANDRA-4446

Check for replayed operations in the system log..

grep -i replay /path/to/system.log


(I see further down thread that your logs do in fact contain replay.
Please comment to this effect, and how negative the experience is for
counts, on 4446?)

 I saw that the sudo apt-get install cassandra stop the server and restart
 it automatically. So it updated without draining and restart before I had
 the time to reconfigure the conf files. Is this normal ? Is there a way to
 avoid it ?

https://issues.apache.org/jira/browse/CASSANDRA-2356

Perhaps if you comment on here you can help convince the debian
packager that auto-starting a distributed database when you install or
upgrade its package has negative operational characteristics.

 After both of these updates I saw my current counters increase without any
 reason.

 Did I do anything wrong ?

Expecting counters to not over-count may qualify as wrong. But your
process seems reasonable.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Query regarding SSTable timestamps and counts

2012-11-19 Thread Rob Coli
On Sun, Nov 18, 2012 at 7:57 PM, Ananth Gundabattula
agundabatt...@gmail.com wrote:
 As per the above url,  After running a major compaction, automatic minor
 compactions are no longer triggered, frequently requiring you to manually
 run major compactions on a routine basis. ( Just before the heading Tuning
 Column Family compression in the above link)

This inaccurate statement has been questioned a few times on the
mailing list. Generally what happens is people discuss it for about 10
emails and then give up because they can't really make sense of it. If
you google for cassandra-user and that sentence above, you should find
the threads. I suggest mailing d...@datastax.com, explaining your
confusion, and asking them to fix it.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: row cache re-fill very slow

2012-11-19 Thread Rob Coli
On Mon, Nov 19, 2012 at 6:17 AM, Andras Szerdahelyi 
andras.szerdahe...@ignitionone.com wrote:

 How is the saved row cache file processed? Are the cached row keys
 simply iterated over and their respective rows read from SSTables -
 possibly creating random reads with small enough sstable files, if the keys
 were not stored in a manner optimised for a quick re-fill ? -  or is there
 a smarter algorithm ( i.e. scan through one sstable at a time, filter rows
 that should be in row cache )  at work and this operation is purely disk
 i/o bound ?


Nope, that's it. I am quite confident that in the version you are running,
it just assembles the row from disk, from the relevant SSTables, via the
more or less normal read path. The more fragmented your sstables, the more
random the i/o.

These two 1.2.x era JIRA relate to the row cache startup penalty :

https://issues.apache.org/jira/browse/CASSANDRA-4282 # multi-threaded row
cache loading at startup
https://issues.apache.org/jira/browse/CASSANDRA-3762 # improvements to the
AutoSavingCache (which is the base class of AutoSavingRowCache)

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Question regarding the need to run nodetool repair

2012-11-15 Thread Rob Coli
On Thu, Nov 15, 2012 at 4:12 PM, Dwight Smith
dwight.sm...@genesyslab.com wrote:
 I have a 4 node cluster,  version 1.1.2, replication factor of 4, read/write
 consistency of 3, level compaction. Several questions.

Hinted Handoff is broken in your version [1] (and all versions between
1.0.0 and 1.0.3 [2]). Upgrade to 1.1.6 ASAP so that the answers below
actually apply, because working Hinted Handoff is involved.

 1)  Should nodetool repair be run regularly to assure it has completed
 before gc_grace?  If it is not run, what are the exposures?

If you do DELETE logical operations, yes. If not, no. gc_grace_seconds
only applies to tombstones, and if you do not delete you have no
tombstones. If you only DELETE in one columnfamily, that is the only
one you have to repair within gc_grace.

Exposure is zombie data, where a node missed a DELETE (and associated
tombstone) but had a previous value for that column or row and this
zombie value is resurrected and propagated by read repair.

 2)  If a node goes down, and is brought back up prior to the 1 hour
 hinted handoff expiration, should repair be run immediately?

In theory, if hinted handoff is working, no. This is a good thing
because otherwise simply restarting a node would trigger the need for
repair. In practice I would be shocked if anyone has scientifically
tested it to the degree required to be certain all edge cases are
covered, so I'm not sure I would rely on this being true. Especially
as key components of this guarantee such as Hinted Handoff can be
broken for 3-5 point releases before anyone notices.

It is because of this uncertainty that I recommend periodic repair
even in clusters that don't do DELETE.

 3)  If the hinted handoff has expired, the plan is to remove the node
 and start a fresh node in its place.  Does this approach cause problems?

Yes.

1) You've lost any data that was only ever replicated to this node.
With RF=3, this should be relatively rare, even with CL.ONE, because
writes are much more likely to succeed-but-report-they-failed than
vice versa. If you run periodic repair, you cover the case where
something gets under-replicated and then even less replicated as nodes
are replaced.
2) When you replace the node in its place (presumably using
replace_token) you will only stream the relevant data from a single
other replica. This means that, given 3 nodes A B C where datum X is
on A and B, and B fails, it might be bootstrapped using C as a source,
decreasing your replica count of X by 1.

In order to deal with these issues, you need to run a repair of the
affected node after bootstrapping/replace_tokening. Until this repair
completes, CL.ONE reads might be stale or missing. I think what
operators really want is a path by which they can bootstrap and then
repair, before returning the node to the cluster. Unfortunately there
are significant technical reasons which prevent this from being
trivial.

As such, I suggest increasing gc_grace_seconds and
max_hint_window_in_ms to reduce the amount of repair you need to run.
The negative to increasing gc_grace is that you store tombstones for
longer before purging them. The negative to increasing
max_hint_window_in_ms is that hints for a given token are stored in
one row.. and very wide rows can exhibit pathological behavior.

Also if you set max_hint_window_in_ms too high, you could cause
cascading failure as nodes fill with hints, become less performant...
thereby increasing the cluster-wide hint rate. Unless you have a very
high write rate or really lazy ops people who leave nodes down for
very long times, the cascading failure case is relatively unlikely.

=Rob

[1] https://issues.apache.org/jira/browse/CASSANDRA-4772
[2] https://issues.apache.org/jira/browse/CASSANDRA-3466


-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: backup/restore from sstable files ?

2012-11-12 Thread Rob Coli
On Sat, Nov 10, 2012 at 3:00 PM, Tyler Hobbs ty...@datastax.com wrote:
 For an alternative that doesn't require the same ring topology, you can use
 the bulkloader, which will take care of distributing the data to the correct
 nodes automatically.

For more details on which cases are best for the different bulk
loading techniques :

http://palominodb.com/blog/2012/09/25/bulk-loading-options-cassandra

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Single Node Cassandra Installation

2012-11-12 Thread Rob Coli
On Sat, Nov 10, 2012 at 6:16 PM, Drew Kutcharian d...@venarc.com wrote:
 Thanks Rob, this makes sense. We only have one rack at this point, so I think 
 it'd be better to start with PropertyFileSnitch to make Cassandra think that 
 these nodes each are in a different rack without having to put them on 
 different subnets. And I will have more flexibility (at the cost of keeping 
 the property file in sync) when it comes to growth.

Many people run successfully with PFS, and as you say it provides
flexibility if you get a second rack. The overhead versus a
non-rack-aware snitch is not significant.

However if you are careful you should be able to switch to PFS or
another rack aware snitch with no problems when you actually need
it... :)

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Counter column families (pending replicate on write stage tasks)

2012-11-12 Thread Rob Coli
On Mon, Nov 12, 2012 at 3:35 PM, cem cayiro...@gmail.com wrote:
 We are currently facing a performance issue with counter column families. I
 see lots of pending ReplicateOnWriteStage tasks in tpstats. Then I disabled
 replicate_on_write. It helped a lot. I want to use like that  but I am not
 sure how to use it.

Quoting Sylvain, one of the primary authors/maintainers of the
Counters support...

https://issues.apache.org/jira/browse/CASSANDRA-3868

I don't disagree about the efficiency of the valve, but at what price?
'Bootstrapping a node will make you lose increments (you don't know
which ones, you don't know how many and this even if nothing goes
wrong)' is a pretty bad drawback. That is pretty much why that option
makes me uncomfortable: it does give you better performance, so people
may be tempted to use it. Now if it was only a matter of replicating
writes only through read-repair/repair, then ok, it's pretty dangerous
but it's rather easy to explain/understand the drawback (if you don't
lose a disk, you don't lose increments, and you'd better use CL.ALL or
have read_repair_chance to 1). But the fact that it doesn't work with
bootstrap/move makes me wonder if having the option at all is not
making a disservice to users.


IOW, don't be tempted to turn off replicate_on_write. It breaks
counters. If you are under capacity at a steady state, increase
capacity.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: leveled compaction and tombstoned data

2012-11-09 Thread Rob Coli
On Thu, Nov 8, 2012 at 10:12 AM, B. Todd Burruss bto...@gmail.com wrote:
 my question is would leveled compaction help to get rid of the tombstoned
 data faster than size tiered, and therefore reduce the disk space usage?

You could also...

1) run a major compaction
2) code up sstablesplit
3) profit!

This method incurs a management penalty if not automated, but is
otherwise the most efficient way to deal with tombstones and obsolete
data.. :D

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: backup/restore from sstable files ?

2012-11-09 Thread Rob Coli
On Thu, Nov 8, 2012 at 5:15 PM, Yang tedd...@gmail.com wrote:
 some of my colleagues seem to use this method to backup/restore a cluster,
 successfully:

 on each of the node, save entire /cassandra/data/ dir to S3,
 then on a new set of nodes, with exactly the same number of nodes,  copy
 back each of the data/ dir.

 then boot up cluster.

Yep, that works as long as the two clusters have the same tokens and
replication strategies.

 but I wonder how it worked: doesn't the system keyspace store information
 specific to the current cluster, such as my sibling nodes in the cluster, my
 IP ?? all these would change once you copy the frozen data files onto a
 new set of nodes.

Yes, for this reason you should not restore the system keyspace files
(except, optionally, Schema.). Definitely you should not restore
LocationInfo. LocationInfo contains ip-to-token mappings. Also you
should make your target cluster have a unique cluster name, and the
old cluster name is also stored in LocationInfo...

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: unsubscribe

2012-11-09 Thread Rob Coli
On Thu, Nov 8, 2012 at 4:57 PM, Jeremy McKay
jeremy.mc...@ntrepidcorp.com wrote:


http://wiki.apache.org/cassandra/FAQ#unsubscribe

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Single Node Cassandra Installation

2012-11-05 Thread Rob Coli
On Mon, Nov 5, 2012 at 12:23 PM, Drew Kutcharian d...@venarc.com wrote:
 Switching from SimpleStrategy to RackAware can be a pain.

 Can you elaborate a bit? What would be the pain point?

If you don't maintain the same replica placement vis a vis nodes on
your cluster, you have to dump and reload.

Simple example, 6 node cluster RF=3 :

SimpleSnitch : A B C D E F

Data for natural range of A is also on B and C, the next nodes in the ring.

RackAwareSnitches : A B C D E F
racks they are in  :  1 1  2  2 3 3

Data for natural range of A is also on C and E, because despite not
being the next nodes in the RING, they are the first nodes in the next
rack.

If however you go from simple to rack aware and put your nodes in racks like :

A B C D E F
1 2 3 1 2 3

Then you have the same replica placement that SimpleStrategy gives you
and can safely switch strategies/snitches on an existing cluster. Data
for A is on B and C, on the same hosts, but for different reasons. Use
nodetool getendpoints to test.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: repair, compaction, and tombstone rows

2012-11-02 Thread Rob Coli
On Fri, Nov 2, 2012 at 2:46 AM, horschi hors...@gmail.com wrote:
 might I ask why repair cannot simply ignore anything that is older than
 gc-grace? (like Aaron proposed)  I agree that repair should not process any
 tombstones or anything. But in my mind it sounds reasonable to make repair
 ignore timed-out data. Because the timestamp is created on the client, there
 is no reason to repair these, right?

IIRC, tombstone timestamps are written by the server, at compaction
time. Therefore if you have RF=X, you have X different timestamps
relative to GCGraceSeconds. I believe there was another thread about
two weeks ago in which Sylvain detailed the problems with what you are
proposing, when someone else asked approximately the same question.

 I even noticed an increase when running two repairs directly after each
 other. So even when data was just repaired, there is still data being
 transferred. I assume this is due some columns timing out within that
 timeframe and the entire row being repaired.

Merkle trees are an optimization, what they trade for this
optimization is over-repair.

(FWIW, I agree that, if possible, this particular case of over-repair
would be nice to eliminate.)

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: repair, compaction, and tombstone rows

2012-11-01 Thread Rob Coli
On Thu, Nov 1, 2012 at 1:43 AM, Sylvain Lebresne sylv...@datastax.com wrote:
 on all your columns), you may want to force a compaction (using the
 JMX call forceUserDefinedCompaction()) of that sstable. The goal being
 to get read of a maximum of outdated tombstones before running the
 repair (you could also alternatively run a major compaction prior to
 the repair, but major compactions have a lot of nasty effect so I
 wouldn't recommend that a priori).

If sstablesplit (reverse compaction) existed, major compaction would
be a simple solution to this case. You'd major compact and then split
your One Giant SSTable With No Tombstones into a number of smaller
ones. :)

https://issues.apache.org/jira/browse/CASSANDRA-4766

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Data migration between clusters

2012-10-31 Thread Rob Coli
On Tue, Oct 30, 2012 at 4:18 AM, 張 睿 chou...@cyberagent.co.jp wrote:
 Does anyone here know if there is an efficient way to migrate multiple
 cassandra clusters' data
 to a single cassandra cluster without any dataloss.

Yes.

1) create schema which is superset of all columnfamilies and all keyspaces
2) if all source clusters were the same fixed number of nodes, create
a new cluster with the same fixed number of nodes
3) nodetool drain and shut down all nodes on all participating clusters
4) copy sstables from old clusters, maintaining that data from source
node [x] ends up on target node [x]
5) start cassandra

However without more details as to your old clusters, new clusters,
and availability requirements, I can't give you a more useful answer.

Here's some background on bulk loading, including copy-the-sstables.

http://palominodb.com/blog/2012/09/25/bulk-loading-options-cassandra

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Wrong data after rolling restart

2012-10-30 Thread Rob Coli
On Mon, May 21, 2012 at 7:08 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:
 Here are my 2 nodes starting logs, I hop it can help...

 https://gist.github.com/2762493
 https://gist.github.com/2762495

I see in these logs that you replay 2 mutations per node, despite
doing nodetool drain before restarting. However 2 replayed mutations
per node is unlikely to corrupt a significant number of counters.

As a nodetool drain is supposed to drain the commitlog entirely, you
are encountering :

https://issues.apache.org/jira/browse/CASSANDRA-4446

I also see that you are running 1.0.7. You are unlikely to receive any
useful response from the project if you file this behavior as a bug
against 1.0.7. If you restore your backup, you might wish to upgrade
to 1.0.12 before doing so, in case this is an issue fixed in the
interim.

 I wanted to try a new config. After doing a rolling restart I have all
 my counters false, with wrong values. I stopped my servers with the
 following :
 [ snip ]
 And after restarting the second one I have lost all the consistency of
 my data. All my statistics since September are totally false now in
 production.

What does totally false mean? The most common inaccuracy of
Cassandra Counters is that they slightly overcount, not that they are
totally false in other ways.

Did you repair this cluster at any time?

 1 - How to fix it ? (I have a backup from this morning, but I will
 lose all the data after this date if I restore this backup)

Restoring this backup is the only way you are likely to fix this. Once
counters are corrupt/wrong you have no chance to survive make your
time. Restoring this backup may not even fix it permanently, depending
on what unknown cause is to blame.

 2 - What happened ? How to avoid it ?

Distributed counting has meaningful edge cases, and Cassandra Counters
do not cover 100% of them. As such, I recommend not using them if
accuracy is critically important.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: constant CMS GC using CPU time

2012-10-24 Thread Rob Coli
On Mon, Oct 22, 2012 at 8:38 AM, Bryan Talbot btal...@aeriagames.com wrote:
 The nodes with the most data used the most memory.  All nodes are affected
 eventually not just one.  The GC was on-going even when the nodes were not
 compacting or running a heavy application load -- even when the main app was
 paused constant the GC continued.

This sounds very much like my heap is so consumed by (mostly) bloom
filters that I am in steady state GC thrash.

Do you have heap graphs which show a healthy sawtooth GC cycle which
then more or less flatlines?

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Java 7 support?

2012-10-16 Thread Rob Coli
On Tue, Oct 16, 2012 at 4:45 PM, Edward Sargisson
edward.sargis...@globalrelay.net wrote:
 The Datastax documentation says that Java 7 is not recommended[1]. However,
 Java 6 is due to EOL in Feb 2013 so what is the reasoning behind that
 comment?

I've asked this approximate question here a few times, with no
official response. The reason I ask is that in addition to Java 7 not
being recommended, in Java 7 OpenJDK becomes the reference JVM, and
OpenJDK is also not recommended.

From other channels, I have conjectured that the current advice on
Java 7 is it 'works' but is not as extensively tested (and definitely
not as commonly deployed) as Java 6.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Repair Failing due to bad network

2012-10-12 Thread Rob Coli
https://issues.apache.org/jira/browse/CASSANDRA-3483

Is directly on point for the use case in question, and introduces
rebuild concept..

https://issues.apache.org/jira/browse/CASSANDRA-3487
https://issues.apache.org/jira/browse/CASSANDRA-3112

Are for improvements in repair sessions..

https://issues.apache.org/jira/browse/CASSANDRA-4767

Is for unambiguous indication of repair session status.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: cassandra 1.0.8 memory usage

2012-10-12 Thread Rob Coli
On Fri, Oct 12, 2012 at 1:26 AM, Daniel Woo daniel.y@gmail.com wrote:
What version of Cassandra? What JVM? Are JNA and Jamm working?
 cassandra 1.0.8. Sun JDK 1.7.0_05-b06, JNA memlock enabled, jamm works.

The unusual aspect here is Sun JDK 1.7. Can you use 1.6 on an affected
node and see if the problem disappears?

https://issues.apache.org/jira/browse/CASSANDRA-4571

Exists in 1.1.x (not your case) and is for leaking descriptors and not
memory, but affects both 1.6 and 1.7.

 JMAP shows that the per gen is only 40% used.

What is the usage of the other gens?

 I have very few column families, maybe 30-50. The nodetool shows each node
 has 5 GB load.

Most of your heap being consumed by 30-50 columnfamilies MBeans seems excessive.

 Disable swap for cassandra node
 I am gonna change swappiness to 20%

Even setting swappiness to 0% does not prevent the kernel from
swapping if swap is defined/enabled. I re-iterate my suggestion that
you de-define/disable swap on any node running Cassandra. :)

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: unnecessary tombstone's transmission during repair process

2012-10-11 Thread Rob Coli
On Thu, Oct 11, 2012 at 8:41 AM, Alexey Zotov azo...@griddynamics.com wrote:
 Value of DeletedColumn is a serialized local
 deletion time. We know that local deletion time can be different on
 different nodes for the same tombstone. So hashes of the same tombstone on
 different nodes will be different. Is it true?

Yes, this seems correct based on my understanding of the process of
writing tombstones.

 I think that local deletion time shouldn't be considered in hash's 
 calculation.

I think you are correct; the only thing that matters is whether the
tombstone exists or not. There may be something I am missing about why
the very-unlikely-to-be-identical value should be considered a merkle
tree failure.

https://issues.apache.org/jira/browse/CASSANDRA-2279

Seems related to this issue, fwiw.

 Is transmission of the equals tombstones during repair process a feature? :)
 or is it a bug?

I think it's a bug.

 If it's a bug, I'll create ticket and attach patch to it.

Yay!

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: cassandra 1.0.8 memory usage

2012-10-11 Thread Rob Coli
On Wed, Oct 10, 2012 at 11:04 PM, Daniel Woo daniel.y@gmail.com wrote:
 I am running a mini cluster with 6 nodes, recently we see very frequent
 ParNewGC on two nodes. It takes 200 - 800 ms on average, sometimes it takes
 5 seconds. You know, hte ParNewGC is stop-of-wolrd GC and our client throws
 SocketTimeoutException every 3 minutes.

What version of Cassandra? What JVM? Are JNA and Jamm working?

 I checked the load, it seems well balanced, and the two nodes are running on
 the same hardware: 2 * 4 cores xeon with 16G RAM, we give cassandrda 4G
 heap, including 800MB young generation. We did not see any swap usage during
 the GC, any idea about this?

It sounds like the two nodes that are pathological right now have
exhausted the perm gen with actual non-garbage, probably mostly the
Bloom filters and the JMX MBeans.

 Then I took a heap dump, it shows that 5 instances of JmxMBeanServer holds
 500MB memory and most of the referenced objects are JMX mbean related, it's
 kind of wired to me and looks like a memory leak.

Do you have a large number of ColumnFamilies? How large is the data
stored per node?

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: cassandra 1.0.8 memory usage

2012-10-11 Thread Rob Coli
On Thu, Oct 11, 2012 at 11:02 AM, Rob Coli rc...@palominodb.com wrote:
 On Wed, Oct 10, 2012 at 11:04 PM, Daniel Woo daniel.y@gmail.com wrote:
  We did not see any swap usage during the GC, any idea about this?

As an aside.. you shouldn't have swap enabled on a Cassandra node,
generally. As a simple example, if you have swap enabled and use the
off-heap row cache, the kernel might swap your row cache.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: 1.1.1 is repair still needed ?

2012-10-10 Thread Rob Coli
On Tue, Oct 9, 2012 at 12:56 PM, Oleg Dulin oleg.du...@gmail.com wrote:
 My understanding is that the repair has to happen within gc_grace period.
 [ snip ]
 So the question is, is this still needed ? Do we even need to run nodetool
 repair ?

If Hinted Handoff works in your version of Cassandra, and that version
is  1.0, you should not need to repair if no node has crashed or
been down for longer than max_hint_window_in_ms. This is because after
1.0, any failed write to a remote replica results in a hint, so any
DELETE should eventually be fully replicated.

However hinted handoff is meaningfully broken between 1.1.0 and 1.1.6
(unreleased) so you cannot rely on the above heuristic for
consistency. In these versions, you have to repair (or read repair
100% of keys) once every GCGraceSeconds to prevent the possibility of
zombie data. If it were possible to repair on a per-columnfamily
basis, you could get a significant win by only repairing
columnfamilies which take DELETE traffic.

https://issues.apache.org/jira/browse/CASSANDRA-4772

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: 1.1.5 Missing Insert! Strange Problem

2012-09-27 Thread Rob Coli
On Thu, Sep 27, 2012 at 3:25 PM, Arya Goudarzi gouda...@gmail.com wrote:
 rcoli helped me investigate this issue. The mystery was that the segment of
 commit log was probably not fsynced to disk since the setting was set to
 periodic with 10 second delay and CRC32 checksum validation failed skipping
 the reply, so what happened in my scenario can be explained by this. I am
 going to change our settings to batch mode.

To be clear, I conjectured that this behavior is the cause of the
issue. As there is no logging when Cassandra encounters a corrupt log
segment [1] during replay, I was unable to verify this conjecture.

Calling nodetool drain as part of a restart process should [2]
eliminate any chance of unsynced writes being lost, and is likely to
be more performant overall than changing to batch mode.

=Rob
[1] I plan to submit a patch for this..
[2] But doesn't necessarily in 1.0.x, CASSANDRA-4446 ...

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: downgrade from 1.1.4 to 1.0.X

2012-09-27 Thread Rob Coli
On Thu, Sep 27, 2012 at 2:46 AM, Віталій Тимчишин tiv...@gmail.com wrote:
 I suppose the way is to convert all SST to json, then install previous
 version, convert back and load

Only files flushed in the new version will need to be dumped/reloaded.

Files which have not been scrubbed/upgraded (ie, have the 1.0 -h[x]-
version) get renamed to different names in 1.1. You can revert all of
these files back to 1.0 as long as you change their names back to 1.0
style names, which is presumably what your snapshots contain...

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: any ways to have compaction use less disk space?

2012-09-26 Thread Rob Coli
On Wed, Sep 26, 2012 at 6:05 AM, Sylvain Lebresne sylv...@datastax.com wrote:
 On Wed, Sep 26, 2012 at 2:35 AM, Rob Coli rc...@palominodb.com wrote:
 150,000 sstables seem highly unlikely to be performant. As a simple
 example of why, on the read path the bloom filter for every sstable
 must be consulted...

 Unfortunately that's a bad example since that's not true.

You learn something new every day. Thanks for the clarification.

I reduce my claim to a huge number of SSTables are unlikely to be
performant. :)

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Why data tripled in size after repair?

2012-09-26 Thread Rob Coli
On Wed, Sep 26, 2012 at 9:30 AM, Andrey Ilinykh ailin...@gmail.com wrote:
 [ repair ballooned my data size ]
 1. Why repair almost triples data size?

You didn't mention what version of cassandra you're running. In some
old versions of cassandra (prior to 1.0), repair often creates even
more extraneous data than it should by design.

However, by design, Repair repairs differing ranges based on merkle
trees. Merkle trees are an optimization, what you trade for the
optimization is over-repair. When you have multiple replicas, each
over-repairs. If you are running repair on your whole cluster, this is
why you should use repair -pr, as it reduces the per-replica
over-repair.

 2. How to compact my data back to 100G?

1) do a major compaction, one CF at a time. if you only have one CF,
you're out of luck because you don't have enough headroom.
2) then convince someone to write sstablesplit so you can turn your
100G sstable into [n] smaller sstables and/or learn to live with your
giant sstable

Or add a new data directory with more space in it, to allow you to
compact. I mention the latter in case it is trivial to attach
additional storage in your env.

The other alternative is to wait. Most space will be reclaimed over
time by minor compaction.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Can't change replication factor in Cassandra 1.1.2

2012-09-25 Thread Rob Coli
On Wed, Jul 18, 2012 at 10:27 AM, Douglas Muth doug.m...@gmail.com wrote:
 Even though keyspace test1 had a replication_factor of 1 to start
 with, each of the above UPDATE KEYSPACE commands caused a new UUID to
 be generated for the schema, which I assume is normal and expected.

I believe the actual issue you have is stuck schema for this
keyspace, not anything to do with replication factor. To test this,
try adding a ColumnFamily and see if it works. I bet it won't.

There are anecdotal reports in the 1.0.8-1.1.5 timeframe of this
happening. One of the causes is the one aaron pasted, but I do not
believe that is the only cause of this edge case. As far as I know,
however, there is no JIRA ticket open for stuck schema for keyspace
... perhaps you might want to look for and/or open one?

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: any ways to have compaction use less disk space?

2012-09-25 Thread Rob Coli
On Sun, Sep 23, 2012 at 12:24 PM, Aaron Turner synfina...@gmail.com wrote:
 Leveled compaction've tamed space for us. Note that you should set
 sstable_size_in_mb to reasonably high value (it is 512 for us with ~700GB
 per node) to prevent creating a lot of small files.

 512MB per sstable?  Wow, that's freaking huge.  From my conversations
 with various developers 5-10MB seems far more reasonable.   I guess it
 really depends on your usage patterns, but that seems excessive to me-
 especially as sstables are promoted.

700gb = 716800mb  / 5mb = 143360

150,000 sstables seem highly unlikely to be performant. As a simple
example of why, on the read path the bloom filter for every sstable
must be consulted...

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Is it possible to create a schema before a Cassandra node starts up ?

2012-09-24 Thread Rob Coli
On Fri, Sep 14, 2012 at 7:05 AM, Xu, Zaili z...@pershing.com wrote:
 I am pretty new to Cassandra. I have a script that needs to set up a schema
 first before starting up the cassandra node. Is this possible ? Can I create
 the schema directly on cassandra storage and then when the node starts up it
 will pick up the schema ?

Aaron gave you the scientific answer, which is that you can't load
schema without starting a node.

However if you :

1) start a node for the first time
2) load schema
3) call nodetool drain so all system keyspace CFs are guaranteed to be
flushed to sstables
4) then, from your script, start that node (or a node with identical
configuration) using the flushed system sstables (directly on the
storage)

You can set up a schema before starting up the cassandra node or
having a cassandra node or cluster running all the time. This might be
useful in for example testing contexts...

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Using the commit log for external synchronization

2012-09-21 Thread Rob Coli
On Fri, Sep 21, 2012 at 4:31 AM, Ben Hood 0x6e6...@gmail.com wrote:
 So if I understand you correctly, one shouldn't code against what is
 essentially an internal artefact that could be subject to change as
 the Cassandra code base evolves and furthermore may not contain the
 information an application thinks it should contain.

Pretty much.

 So in summary, given that there is no out of the box way of saying to
 Cassandra give me all mutations since timestamp X, I would either
 have to go for an event driven approach or reconsider the layout of
 the Cassandra store such that I could reconcile it in an efficient
 fashion.

With :

https://issues.apache.org/jira/browse/CASSANDRA-3690 - Streaming
CommitLog backup

You can stream your commitlog off-node as you write it. You can then
restore this commitlog and tell cassandra to replay the commit log
until a certain time by using restore_point_in_time. But...
without :

https://issues.apache.org/jira/browse/CASSANDRA-4392 - Create a tool
that will convert a commit log into a series of readable CQL
statements

You are unable to skip bad transactions, so if you want to
roll-forward but skip a TRUNCATE, you are out of luck.

The above gets you most of the way there, but Aaron's point about the
commitlog not reflecting whether the app met its CL remains true. The
possibility that Cassandra might coalesce to a value that the
application does not know was successfully written is one of its known
edge cases...

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: How to replace a dead *seed* node while keeping quorum

2012-09-12 Thread Rob Coli
On Tue, Sep 11, 2012 at 4:21 PM, Edward Sargisson
edward.sargis...@globalrelay.net wrote:
 If the downed node is a seed node then neither of the replace a dead node
 procedures work (-Dcassandra.replace_token and taking initial_token-1). The
 ring remains split.
 [...]
 In other words, if the host name is on the seeds list then it appears that
 the rest of the ring refuses to bootstrap it.

Close, but not exactly...

./src/java/org/apache/cassandra/service/StorageService.java line 559 of 3090

if (DatabaseDescriptor.isAutoBootstrap()

DatabaseDescriptor.getSeeds().contains(FBUtilities.getBroadcastAddress())
 !SystemTable.isBootstrapped())
logger_.info(This node will not auto bootstrap because it
is configured to be a seed node.);


getSeeds asks your seed provider for a list of seeds. If you are using
the SimpleSeedProvider, this basically turns the list from seeds in
cassandra.yaml on the local node into a list of hosts.

So it isn't that the other nodes have this node in their seed list..
it's that the node you are replacing has itself in its own seed list,
and shouldn't. I understand that it can be tricky in conf management
tools to make seed nodes' seed lists not contain themselves, but I
believe it is currently necessary in this case.

FWIW, it's unclear to me (and Aaron Morton, whose curiousity was
apparently equally piqued and is looking into it further..) why
exactly seed nodes shouldn't bootstrap. It's possible that they only
shouldn't bootstrap without being in hibernate mode, and that the
code just hasn't been re-written post replace_token/hibernate to say
that it's ok for seed nodes to bootstrap as long as they hibernate...

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Node-tool drain on Cassandra 1.0

2012-09-12 Thread Rob Coli
On Sun, Sep 9, 2012 at 12:01 PM, Robin Verlangen ro...@us2.nl wrote:
 Deleting the commitlog files is harmless. It's just a tool that tries to
 keep Cassandra more in-sync with the other nodes. A standard repair will fix
 all problems that a commitlog replay might do too.

This is not really true.. imagine a RF=2 cluster.

1) take a replica node down
2) write at CL.ONE to another replica node
3) replication fails to other replica due to it being down, hint is
queued locally, this means both writes are only in memtables/mirrored
in the commitlog
4) don't nodetool flush
5) stop node
6) delete commitlog

You have now lost data, and repair can't fix it, because the data
you've lost has not been written to any other node. This is one of the
edge cases that makes CL.ONE pretty risky if you care about your data
and use a RF under 3.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Node-tool drain on Cassandra 1.0

2012-09-07 Thread Rob Coli
On Fri, Sep 7, 2012 at 6:38 AM, Rene Kochen rene.koc...@schange.com wrote:
 If I use node-tool drain, it does stop accepting writes and flushes the
 tables. However, is it normal that the commit log files are not deleted and
 that it gets replayed?

It's not expected by design, but it does seem to be normal in
cassandra 1.0.x. I've spoken with other operators and they anecdotally
report the same behavior when doing the same operation you describe.

https://issues.apache.org/jira/browse/CASSANDRA-4446

The more people who report that they have the issue, the greater
chance of a response or fix, so I suggest commenting me too! on that
ticket.. :)

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: unsubscribe

2012-09-05 Thread Rob Coli
http://wiki.apache.org/cassandra/FAQ#unsubscribe

On Wed, Aug 29, 2012 at 3:57 PM, Juan Antonio Gomez Moriano 
mori...@exciteholidays.com wrote:


 --
   *Juan Antonio Gomez Moriano*
 DEVELOPER TEAM LEADER  [image: Excite Holidays]

 T +61 2 8061 2917

 emori...@exciteholidays.com

 Wwww.exciteholidays.com
 A Suite 1901, 101 Grafton St, Bondi Junction, NSW 2022, Australia




-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Practical node size limits

2012-09-05 Thread Rob Coli
On Sun, Jul 29, 2012 at 7:40 PM, Dustin Wenz dustinw...@ebureau.com wrote:
 We've just set up a new 7-node cluster with Cassandra 1.1.2 running under 
 OpenJDK6.

It's worth noting that Cassandra project recommends Sun JRE. Without
the Sun JRE, you might not be able to use JAMM to determine the live
ratio. Very few people use OpenJDK in production, so using it also
increases the likelihood that you might be the first to encounter a
given issue. FWIW!

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: adding node to cluster

2012-08-31 Thread Rob Coli
On Thu, Aug 30, 2012 at 10:39 PM, Casey Deccio ca...@deccio.net wrote:
 In what way are the lookups failing? Is there an exception?

 No exception--just failing in that the data should be there, but isn't.

At ConsistencyLevel.ONE or QUORUM?

If you are bootstrapping the node, I would expect there to be no
chance of serving blank reads like this. As auto_bootstrap is set to
true by default, I presume you are bootstrapping.

Which node are you querying to get the no data response?

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: adding node to cluster

2012-08-30 Thread Rob Coli
On Thu, Aug 30, 2012 at 10:18 AM, Casey Deccio ca...@deccio.net wrote:
 I'm adding a new node to an existing cluster that uses
 ByteOrderedPartitioner.  The documentation says that if I don't configure a
 token, then one will be automatically generated to take load from an
 existing node.
 What I'm finding is that when I add a new node, (super)
 column lookups begin failing (not sure if it was the row lookup failing or
 the supercolumn lookup failing), and I'm not sure why.

1) You almost never actually want BOP.
2) You never want Cassandra to pick a token for you. IMO and the
opinion of many others, the fact that it does this is a bug. Specify a
token with initial_token.
3) You never want to use Supercolumns. The project does not support
them but currently has no plan to deprecate them. Use composite row
keys.
4) Unless your existing cluster consists of one node, you almost never
want to add only a single new node to a cluster. In general you want
to double it.

In summary, you are Doing It just about as Wrong as possible... but on
to your actual question ... ! :)

In what way are the lookups failing? Is there an exception?

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: one node with very high loads

2012-08-27 Thread Rob Coli
On Mon, Aug 27, 2012 at 9:25 AM, Senthilvel Rangaswamy
senthil...@gmail.com wrote:
 We are running 1.1.2 on m1.xlarge with ephemeral store for data. We are
 seeing very high loads on one of the nodes in the ring, 30+.

My first hunch would be that you are sending all client requests to
this one node, so it is coordinating 30x as many requests as it
should.

If that's not the case, if I were you I would attempt to determine if
the high i/o is high read or write on the node, via a tool like iotop.
You can also compare the tpstats of two nodes with similar uptimes to
see if your node is performing more of any stage than other members of
its cohort.

Once you determine whether it's read or write, determine which files
are being read or written.. :)

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Node forgets about most of its column families

2012-08-23 Thread Rob Coli
On Thu, Aug 23, 2012 at 11:47 AM, Edward Sargisson
edward.sargis...@globalrelay.net wrote:
 I was wondering if anybody had seen the following behaviour before and how
 we might detect it and keep the application running.

I don't know the answer to your problem, but anyone who does will want
to know in what version of Cassandra you are encountering this issue.
:)

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: nodetool repair - when is it not needed ?

2012-08-22 Thread Rob Coli
On Wed, Aug 22, 2012 at 8:37 AM, Senthilvel Rangaswamy
senthil...@gmail.com wrote:
 We are running Cassandra 1.1.2 on EC2. Our database is primarily all
 counters and we don't do any
 deletes.

 Does nodetool repair do anything for such a database. All the docs I read
 for nodetool repair suggests
 that nodetool repair is needed only if there is deletes.

Since 1.0, repair is only needed if a node crashes. If a node crashes,
my understanding is that a cluster-wide repair (with -pr on each node)
is required, because the crashed node could have lost a hint for any
other node.

https://issues.apache.org/jira/browse/CASSANDRA-2034

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Why so slow?

2012-08-22 Thread Rob Coli
On Sun, Aug 19, 2012 at 11:09 AM, Peter Morris mrpmor...@gmail.com wrote:
 Is the Windows community edition crippled for network use perhaps, or could
 the problem be something else?

It's not crippled but it underperforms Cassandra on Linux. Cassandra
contains various Linux specific optimizations which result in improved
performance if they can be used. I'm not sure anyone has shiny graphs
comparing the two, but I would expect Windows Cassandra to be
discernably less performant.

That said, this is not the issue in your OP. :)

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: A few questions about Cassandra's native protocol

2012-08-22 Thread Rob Coli
On Wed, Aug 22, 2012 at 2:12 AM, Christoph Hack christ...@tux21b.org wrote:
 4. Prepared Statements

FWIW, while I suppose a client author is technically a user of
Cassandra, you appear to be making suggestions related to the
development of Cassandra. As I understand the conceptual seperation
between lists, you probably want to send such mails to cassandra-dev.
:D

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: change cluster name

2012-08-09 Thread Rob Coli
On Wed, Aug 8, 2012 at 10:28 PM, rajesh.ba...@orkash.com
rajesh.ba...@orkash.com wrote:
 i would suggest you delete the files in your system keyspace folder except
 files like Schema*.*.

This thread could have been much shorter with a judicious use of grep, heh ...

user@hostname# grep -i name /etc/cassandra/cassandra.yaml
cluster_name: 'QA Cass Cluster'

user@hostname # grep 'Cass Cluster' /mnt/cassandra/data/system/*
Binary file /mnt/cassandra/data/system/LocationInfo-hd-5-Data.db matches

Just remove LocationInfo files from the system keyspace when
changing Cluster Names. Nuking the other stuff is not necessary.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: SSTable format

2012-07-13 Thread Rob Coli
On Fri, Jul 13, 2012 at 5:18 PM, Dave Brosius dbros...@baybroadband.net wrote:
 It depends on what partitioner you use. You should be using the
 RandomPartitioner, and if so, the rows are sorted by the hash of the row
 key. there are partitioners that sort based on the raw key value but these
 partitioners shouldn't be used as they have problems due to uneven
 partitioning of data.

The formal way this works in the code is that SSTables are ordered by
decorated row key, where decoration is only a transformation when
you are not using OrderedPartitioner. FWIW, in case you see that
DecoratedKey syntax while reading code..

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: cassandra on re-Start

2012-07-02 Thread Rob Coli
On Mon, Jul 2, 2012 at 5:43 AM, puneet loya puneetl...@gmail.com wrote:
 When I restarted the system , it is showing the keyspace does not exist.

 Not even letting me to create the keyspace with the same name again.

Paste the error you get.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Snapshot failing on JSON files in 1.1.0

2012-06-19 Thread Rob Coli
On Tue, Jun 19, 2012 at 2:55 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:
 Unable to create hard link from
 /raid0/cassandra/data/cassa_teads/stats_product-hc-233-Data.db to
 /raid0/cassandra/data/cassa_teads/snapshots/1340099026781/stats_product-hc-233-Data.db

Are you able to create this hard link via the filesystem? I am conjecturing not.

Is snapshots perhaps on a different mountpoint than the directory
you are trying to create a snapshot via hardlinks?

=Rob
PS - boy, 9 emails in the thread.. full of log output, sure don't miss
them not being bottom-quoted to every email... :)

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Snapshot failing on JSON files in 1.1.0

2012-06-19 Thread Rob Coli
On Tue, Jun 19, 2012 at 8:55 PM, Rob Coli rc...@palominodb.com wrote:
 On Tue, Jun 19, 2012 at 2:55 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:
 Unable to create hard link from
 /raid0/cassandra/data/cassa_teads/stats_product-hc-233-Data.db to
 /raid0/cassandra/data/cassa_teads/snapshots/1340099026781/stats_product-hc-233-Data.db

 Are you able to create this hard link via the filesystem? I am conjecturing 
 not.

FWIW, erno being given by OS and passed through Java is 1 :

http://freespace.sourceforge.net/errno/linux.html

 1 EPERM+Operation not permitted


=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: GCInspector works every 10 seconds!

2012-06-19 Thread Rob Coli
On Mon, Jun 18, 2012 at 12:07 AM, Jason Tang ares.t...@gmail.com wrote:
 After I enable key cache and row cache, the problem gone, I guess it because
 we have lots of data in SSTable, and it takes more time, memory and cpu to
 search the data.

The Key Cache is usually a win if added like this. The Row cache is
less likely to be. If I were you I would check your row cache hit
rates to make sure you are actually getting win. :)

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: 1.1 not removing commit log files?

2012-06-01 Thread Rob Coli
On Thu, May 31, 2012 at 7:01 PM, aaron morton aa...@thelastpickle.com wrote:
 But that talks about segments not being cleared at startup. Does not explain
 why they were allowed to get past the limit in the first place.

Perhaps the commit log size tracking for this limit does not, for some
reason, track hints? This seems like the obvious answer given the
state which appears to trigger it? This doesn't explain why the files
aren't getting deleted after the hints are delivered, of course...

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Moving to 1.1

2012-05-30 Thread Rob Coli
On Wed, May 30, 2012 at 4:08 AM, Vanger disc...@gmail.com wrote:
 3) Java 7 now recommended for use by Oracle. We have several developers
 running local cassandra instances on it for a while without problems.
 Anybody tried it in production? Some time ago java 7 wasn't recommended for
 use with cassandra, what's for now?

I have a variation of this question, which goes :

Now that OpenJDK is the official Java reference implementation, are
there plans to make Cassandra support it?

https://blogs.oracle.com/henrik/entry/moving_to_openjdk_as_the

Cassandra has (had?) a slightly passive-aggressive log message where
it refers to any JDK other than Sun's as a buggy and suggests that
you should upgrade to the Sun JDK. I'm fine with using whatever JDK
is technically best, but within the enterprise using something other
than the official reference implementation can be a tough sell.
Wondering if people have a view as to the importance and/or
feasibility of making OpenJDK supported.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: commitlog_sync_batch_window_in_ms change in 0.7

2012-05-30 Thread Rob Coli
On Tue, May 29, 2012 at 10:29 PM, Pierre Chalamet pie...@chalamet.net wrote:
 You'd better use version 1.0.9 (using this one in production) or 1.0.10.

 1.1 is still a bit young to be ready for prod unfortunately.

OP described himself as experimenting which I inferred to mean
not-production. I agree with others, 1.0.x is what I'd currently
recommend for production. :)

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: commitlog_sync_batch_window_in_ms change in 0.7

2012-05-29 Thread Rob Coli
On Mon, May 28, 2012 at 6:53 AM, osishkin osishkin osish...@gmail.com wrote:
 I'm experimenting with Cassandra 0.7 for some time now.

I feel obligated to recommend that you upgrade to Cassandra 1.1.
Cassandra 0.7 is better than 0.6, but I definitely still wouldn't be
experimenting with this old version in 2012.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: nodetool repair taking forever

2012-05-25 Thread Rob Coli
On Sat, May 19, 2012 at 8:14 AM, Raj N raj.cassan...@gmail.com wrote:
 Hi experts,
 [ repair seems to be hanging forever ]

https://issues.apache.org/jira/browse/CASSANDRA-2433

Affects 0.8.4.

I also believe there is a contemporaneous bug (reported by Stu Hood?)
regarding failed repair resulting in extra disk usage, but I can't
currently find it in JIRA.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Migrating from a windows cluster to a linux cluster.

2012-05-24 Thread Rob Coli
On Thu, May 24, 2012 at 12:44 PM, Steve Neely sne...@rallydev.com wrote:
 It also seems like a dark deployment of your new cluster is a great method
 for testing the Linux-based systems before switching your mision critical
 traffic over. Monitor them for a while with real traffic and you can have
 confidence that they'll function correctly when you perform the switchover.

FWIW, I would love to see graphs which show their compared performance
under identical write load and then show the cut-over point for reads
between the two clusters. My hypothesis is that your linux cluster
will magically be much more perfomant/less loaded due to many
linux-specific optimizations in Cassandra, but I'd dig seeing this
illustrated in an apples to apples sense with real app traffic.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Number of keyspaces

2012-05-23 Thread Rob Coli
On Tue, May 22, 2012 at 4:56 AM, samal samalgo...@gmail.com wrote:
 Not ideally, now cass has global memtable tuning. Each cf correspond to
 memory  in ram. Year wise cf means it will be in read only state for next
 year, memtable  will still consume ram.

An empty memtable seems unlikely to consume a meaningful amount of
RAM. I'm sure by reading the code I could estimate how little memory
is involved, but I'd be surprised if it is over a few megabytes. This
is independent from the other overhead associated with a CF being
defined, of course.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: exception when cleaning up...

2012-05-22 Thread Rob Coli
On Tue, May 22, 2012 at 3:00 AM, aaron morton aa...@thelastpickle.com wrote:
 1) Isolating the node from the cluster to stop write activity. You can
 either start the node with the -Dcassandra.join_ring=false  JVM option or
 use nodetool disablethrift and disablegossip to stop writes. Note that this
 will not stop existing Hinted Handoff sessions which target the node.

As a result of the last caveat here, I recommend either restarting the
node with the join_ring as false or using iptables to firewall off
ports 7000 and 9160. If you want to be sure that you have stopped
write activity right now, nuking these ports from orbit is the only
way to be sure. disablethrift/disablegossip as currently implemented
are not sufficient for this goal.

https://issues.apache.org/jira/browse/CASSANDRA-4162

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Migrating a column family from one cluster to another

2012-05-18 Thread Rob Coli
On Thu, May 17, 2012 at 9:37 AM, Bryan Fernandez bfernande...@gmail.com wrote:
 What would be the recommended
 approach to migrating a few column families from a six node cluster to a
 three node cluster?

The easiest way (if you are not using counters) is :

1) make sure all filenames of sstables are unique [1]
2) copy all sstablefiles from the 6 nodes to all 3 nodes
3) run a cleanup compaction on the 3 nodes

=Rob
[1] https://issues.apache.org/jira/browse/CASSANDRA-1983

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Migrating a column family from one cluster to another

2012-05-18 Thread Rob Coli
On Fri, May 18, 2012 at 1:41 PM, Poziombka, Wade L
wade.l.poziom...@intel.com wrote:
 How does counters affect this?  Why would be different?

Oh, actually this is an obsolete caution as of Cassandra 0.8beta1 :

https://issues.apache.org/jira/browse/CASSANDRA-1938

Sorry! :)

=Rob
PS - for historical reference, before this ticket the counts were
based on the ip address of the nodes and things would be hosed if you
did the copy-all-the-sstables operations. it is easy for me to forget
that almost no one was using cassandra counters before 0.8, heh.

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Inconsistent dependencies

2012-05-16 Thread Rob Coli
On Tue, Apr 24, 2012 at 12:56 PM, Matthias Pfau p...@l3s.de wrote:
 we just noticed that cassandra is currently published with inconsistent
 dependencies. The inconsistencies exist between the published pom and the
 published distribution (tar.gz). I compared hashes of the libs of several
 versions and the inconsistencies are different each time. However, I have
 not found a single cassandra release without inconsistencies.

Was there every any answer to this question or resolution to this issue?

If not, I suggest to Matthias that he file a JIRA ticket on the Apache
Cassandra JIRA.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Adding a second datacenter

2012-05-16 Thread Rob Coli
On Tue, Apr 24, 2012 at 3:24 PM, Bill Au bill.w...@gmail.com wrote:
 Everything went smoothly until I ran the last step, which is to run nodetool
 repair on all the nodes in the new data center.  Repair is hanging on all
 the new nodes.  I had to hit control-C to break out of it.
 [ snip ]
 Did I missed anything or did something wrong?  How do I recover from this?

http://wiki.apache.org/cassandra/Operations

Running nodetool repair: Like all nodetool operations in 0.7, repair
is blocking: it will wait for the repair to finish and then exit. This
may take a long time on large data sets.


Since 0.7, all nodetool operations are blocking. While repair does
in fact have bugs which make it possible that it will hang in all
extant release versions, the fact that nodetool repair (hopefully you
were using -pr option?) takes a long time to return does not indicate
that it is hanging.

If you see repair and AES messages in system.log, it is probably not
in fact hung. If you don't see said messages for a long time, it might
be hung, in which case the only remedy currently available to you is
to restart the affected nodes.

=Rob
PS - I know this is a reply on a relatively old thread and I think you
maybe received assistance on another thread after this one. If so,
apologies!

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: getting status of long running repair

2012-05-04 Thread Rob Coli
On Fri, May 4, 2012 at 10:30 AM, Bill Au bill.w...@gmail.com wrote:
 I know repair may take a long time to run.  I am running repair on a node
 with about 15 GB of data and it is taking more than 24 hours.  Is that
 normal?  Is there any way to get status of the repair?  tpstats does show 2
 active and 2 pending AntiEntropySessions.  But netstats and compactionstats
 show no activity.

As indicated by various recent threads to this effect, many versions
of cassandra (including current 1.0.x release) contain bugs which
sometimes prevent repair from completing. The other threads suggest
that some of these bugs result in the state you are in now, where you
do not see anything that looks like appropriate activity.
Unfortunately the only solution offered on these other threads is the
one I will now offer, which is to restart the participating nodes and
re-start the repair. I am unaware of any JIRA tickets tracking these
bugs (which doesn't mean they don't exist, of course) so you might
want to file one. :)

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: JNA + Cassandra security

2012-05-01 Thread Rob Coli
On Mon, Apr 30, 2012 at 6:48 PM, Jonathan Ellis jbel...@gmail.com wrote:
 On Mon, Apr 30, 2012 at 7:49 PM, Cord MacLeod cordmacl...@gmail.com wrote:
 Hello group,

 I'm a new Cassandra and Java user so I'm still trying to get my head around 
 a few things.  If you've disabled swap on a machine what is the reason to 
 use JNA?

 Faster snapshots, giving hints to the page cache with fadvise.

If you are running in Linux, you really do want this enabled.
Otherwise, for example, compaction blows out your page cache.

(FWIW, in case it is not immediately apparent what sort of hints
Cassandra might give to the page cache with fadvise..)

https://issues.apache.org/jira/browse/CASSANDRA-1470

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Question regarding major compaction.

2012-05-01 Thread Rob Coli
On Tue, May 1, 2012 at 4:31 AM, Henrik Schröder skro...@gmail.com wrote:
 But what's the difference between doing an extra read from that One Big
 File, than doing an extra read from whatever SSTable happen to be largest in
 the course of automatic minor compaction?

The primary differences, as I understand it, are that the index
performance and bloom filter false positive rate for your One Big File
are worse. First, you are more likely to get a bloom filter false
positive due to the intrinsic degradation of bloom filter performance
as number of keys increases. Next, after traversing the SStable index
to get to the closest indexed key, you will be forced to scan past
more keys which are not your key in order to get to the key which is
your key.

 So I'm still confused. I don't see a significant difference between doing
 the occasional major compaction or leaving it to do automatic minor
 compactions. What am I missing? Reads will continually degrade with
 automatic minor compactions as well, won't they?

I still don't really understand what precisely continually degrade
means here either, FWIW, or the two operating paradigms being compared
under what sort of workloads. As a simple example, I don't believe
performance will continually do anything if your workload does not
issue logical UPDATE or DELETE to rows. The documentation statement
seems confusingly-vaguely-yet-strongly phrased, even if true.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Taking a Cluster Wide Snapshot

2012-04-26 Thread Rob Coli
 I copied all the snapshots from each individual nodes where the snapshot
 data size was around 12Gb on each node to a common folder(one folder alone).

 Strangely I found duplicate file names in multiple snapshots and
 more strangely the data size was different of each duplicate file which lead
 to the total data size to close to 13Gb(else have to be overwritten) where
 as the expectation was 12*6 = 72Gb.

You have detected via experimentation that the namespacing of sstable
filenames per CF per node is not unique. In order to do the operation
you are doing, you have to rename them to be globally unique. Just
inflate the integer part is the easiest way.

https://issues.apache.org/jira/browse/CASSANDRA-1983

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Taking a Cluster Wide Snapshot

2012-04-26 Thread Rob Coli
On Thu, Apr 26, 2012 at 10:38 PM, Shubham Srivastava
shubham.srivast...@makemytrip.com wrote:
 On another thought I could also try copying the data of my keyspace alone 
 from one node to another node in the new cluster (I have both the old and new 
 clusters having same nodes DC1:6,DC2:6 with same tokens) with the same tokens.

 Would there be any risk of the new cluster getting joined to the old cluster 
 probably if the data inside keyspace is aware of the original IP's etc.

As a result of this very concern while @ Digg...

https://issues.apache.org/jira/browse/CASSANDRA-769

tl;dr : as long as your cluster names are unique in your cluster
config (**and you do not copy the System keyspace, letting the new
cluster initialize with the new cluster name**), nodes are at no risk
of joining the wrong cluster.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Resident size growth

2012-04-18 Thread Rob Coli
On Tue, Apr 10, 2012 at 8:40 AM, ruslan usifov ruslan.usi...@gmail.com wrote:
 mmap doesn't depend on jna

FWIW, this confusion is as a result of the use of *mlockall*, which is
used to prevent mmapped files from being swapped, which does depend on
JNA.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Off-heap row cache and mmapped sstables

2012-04-16 Thread Rob Coli
On 4/12/12, Omid Aladini omidalad...@gmail.com wrote:
 Cassandra issues an mlockall [1] before mmap-ing sstables to prevent
 the kernel from paging out heap space in favor of memory-mapped
 sstables. I was wondering, what happens to the off-heap row cache
 (saved or unsaved)? Is it possible that the kernel pages out off-heap
 row cache in favor of resident mmap-ed sstable pages?

For what it's worth, I find this conjecture plausible given my
understanding of the Cassandra ticket which resulted in the use of
JNA+mlockall. I'd love to hear an opinion from someone from the
project with more in-depth knowledge. :)

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Will Cassandra balance load across replicas?

2012-04-05 Thread Rob Coli
On Thu, Apr 5, 2012 at 9:22 AM, zhiming shen zhiming.s...@gmail.com wrote:
 Thanks for your reply. My question is about the impact of replication on
 load balancing. Say we have nodes ABCD... in the ring. ReplicationFactor is
 3 so the data on A will also have replicas on B and C. If we are reading
 data own by A, and A is already very busy, will the requests be forwarded to
 B and C? How about update requests?

Google cassandra dynamic snitch.

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Linux Filesystem for Cassandra

2012-04-04 Thread Rob Coli
On Wed, Apr 4, 2012 at 1:15 PM, Michael Widmann
michael.widm...@gmail.com wrote:
 If you wanna use - ZFS - use smartos / openindiana and cassandra on top
 dont work around with a FUSE FS.
 Maybe BSD (not knowing their version of zfs / zpool)

http://zfsonlinux.org/

(I can't vouch for it, but FYI this is non-FUSE ZFS for linux, seems
actively developed etc.)

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: multi region EC2

2012-04-02 Thread Rob Coli
On Mon, Mar 26, 2012 at 3:31 PM, Deno Vichas  but what if i already
have a bunch (8g per node) data that i need and i
 don't have a way to re-create it.

Note that the below may have unintended consequences if using Counter
column families. It actually can be done with the cluster running,
below is the least tricky version of this process.

a) stop writing to your cluster
b) do a major compaction and then stop cluster
c) ensure globally unique filenames for all sstable files for all cfs
for all nodes
d) copy all sstables to all new nodes
e) start cluster, join new nodes, run cleanup compactions

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Nodetool snapshot, consistency and replication

2012-04-02 Thread Rob Coli
On Mon, Apr 2, 2012 at 9:19 AM, R. Verlangen ro...@us2.nl wrote:
 - 3 node cluster
 - RF = 3
 - fully consistent (not measured, but let's say it is)

 Is it true that when I take a snaphot at only one of the 3 nodes this
 contains all the data in the cluster (at least 1 replica)?

Yes.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


  1   2   >