Re: Facebook messaging and choice of HBase over Cassandra - what can we learn?

2010-11-21 Thread Edward Capriolo
On Sun, Nov 21, 2010 at 12:10 PM, André Fiedler
fiedler.an...@googlemail.com wrote:
 Facebook Messaging – HBase Comes of Age

 http://facility9.com/2010/11/18/facebook-messaging-hbase-comes-of-age

 2010/11/21 David Boxenhorn da...@lookin2.com

 Eventual consistency is not good enough for instant messaging.

 On Sun, Nov 21, 2010 at 6:32 PM, Simon Reavely simon.reav...@gmail.com
 wrote:

 (Posting this to both user + dev lists)

 I was reviewing the blog post on the facebook engineering blog from nov
 15th
 http://www.facebook.com/note.php?note_id=454991608919#
 http://www.facebook.com/note.php?note_id=454991608919#
 The Underlying Technology of Messages
 by Kannan Muthukkaruppan http://www.facebook.com/Kannan



 As a cassandra user I think the key sentence for this community is:
 We found Cassandra's eventual consistency model to be a difficult
 pattern
 to reconcile for our new Messages infrastructure.

 I think it would be useful to find out more about this statement from
 Kannan
 and the facebook team. Does anyone have any contacts in the Facebook
 team?

 My goal here is to understand usage patterns and whether or not the
 Cassandra community can learn from this decision; maybe even understand
 whether the Cassandra roadmap should be influenced by this decision to
 address a target user base. Of course we might also conclude that its
 just
 not a Cassandra use-case!

 Cheers,
 Simon
 --
 Simon Reavely
 simon.reav...@gmail.com






On Sun, Nov 21, 2010 at 11:40 AM, David Boxenhorn da...@lookin2.com wrote:
 Eventual consistency is not good enough for instant messaging.

 On Sun, Nov 21, 2010 at 6:32 PM, Simon Reavely simon.reav...@gmail.com
 wrote:

 (Posting this to both user + dev lists)

 I was reviewing the blog post on the facebook engineering blog from nov
 15th
 http://www.facebook.com/note.php?note_id=454991608919#
 http://www.facebook.com/note.php?note_id=454991608919#
 The Underlying Technology of Messages
 by Kannan Muthukkaruppan http://www.facebook.com/Kannan



 As a cassandra user I think the key sentence for this community is:
 We found Cassandra's eventual consistency model to be a difficult pattern
 to reconcile for our new Messages infrastructure.

 I think it would be useful to find out more about this statement from
 Kannan
 and the facebook team. Does anyone have any contacts in the Facebook team?

 My goal here is to understand usage patterns and whether or not the
 Cassandra community can learn from this decision; maybe even understand
 whether the Cassandra roadmap should be influenced by this decision to
 address a target user base. Of course we might also conclude that its just
 not a Cassandra use-case!

 Cheers,
 Simon
 --
 Simon Reavely
 simon.reav...@gmail.com



Jonathan Ellis pointed out a term that I like using better Tunable
consistency . It seems that eventual consistency confuses everyone,
that or it is an easy target of an anti Cassandra public relation
campaign. If you want consistency use:

WRITE.ALL + READ.ONE (hinted handoff off)
WRITE.QUORUM + READ.QUORUM
WRITE.ONE + READ.ALL

Also I believe saying HBASE is consistent is not true. This can happen:
Write to region server. - Region Server acknowledges client- write
to WAL - region server fails = write lost

I wonder how facebook will reconcile that. :)

Not trying to be nitpicky, at hadoop world in NYC I got to sit with
lots of the hbase guys and we all had a great time talking about the
mutual issues and happiness both of our communities share.

We can not speak for Facebook, but likely chose HBase because they
have several of hadoop core developers and have a large hadoop
deployment. I would say the decision was probably based on several
things. Current Cassandra release does not do on line schema updates.
I am sure facebook does not want to restart 10,000 cassandra servers
for a schema change. Current release does not have memtable tuning per
column family. The upcoming Cassandra release has support for both of
these things and many many more awesome things.

Facebook is on the high end of how much data they have to manage, and
how many servers they have. Most people do not share that use case. We
can learn that facebook chose software that was good for them based on
their use case and the experience they have in house. Something
everyone should do.


Re: Cassandra memtable and GC

2010-11-22 Thread Edward Capriolo
On Mon, Nov 22, 2010 at 8:28 AM, Shotaro Kamio kamios...@gmail.com wrote:
 Hi Peter,

 I've tested again with recording LiveSSTableCount and MemtableDataSize
 via jmx. I guess this result supports my suspect on memtable
 performance because I cannot find Full GC this time.
 This is a result in smaller data size (160million records on
 cassandra) on different disk configuration from my previous post. But
 the general picture doesn't change.

 The attached files:
 - graph-read-throughput-diskT.png:  read throughput on my client program.
 - graph-diskT-stat-with-jmx.png: graph of cpu load, LiveSSTableCount
 and logarithm of MemtableDataSize.
 - log-gc.20101122-12:41.160M.log.gz: GC log with -XX:+PrintGC
 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

 As you can see from the second graph, logarithm of MemtableDataSize
 and cpu load has a clear correlation. When a memtable is flushed and a
 new SSTable is created (LiveSSTableCount is incremented), read
 performance will be recovered. But it degrades soon.
 I couldn't find Full GC in GC log in this test. So, I guess that this
 performance is not a result of GC activity.


 Regards,
 Shotaro


 On Sat, Nov 20, 2010 at 6:37 PM, Peter Schuller
 peter.schul...@infidyne.com wrote:
 After a memtable flush, you see minimum cpu and maximum read
 throughput both in term of disk and cassandra records read.
 As memtable increase in size, cpu goes up and read drops.
 If this is because of memtable or GC performance issue, this is the
 big question.

 As each memtable is just 128MB when flushed, I don't really expect GC
 problem or caching issues.

 A memtable is basically just a ConcurrentSkipListMap. Unless you are
 somehow triggering some kind of degenerate casein the CSLM itself,
 which seems unlikely, the only common circumstance where filling the
 memtable should be resulting in a very significant performance drop
 should be if you're running really close to heap size and causing
 additional GC asymptotally as you're growing the memtable.

 But that doesn't seem to be the case. I don't know, maybe I missed
 something in your original post, but I'm not sure what to suggest that
 I haven't already without further information/hands-on
 experimentation/observation.

 But running with verbose GC as I mentioned should at least be a good
 start (-Xloggc:path/to/gclog
 -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimestamps).

 --
 / Peter Schuller




 --
 Shotaro Kamio


As you can see from the second graph, logarithm of MemtableDataSize
and cpu load has a clear correlation.

This makes sense.

You'll see the disk read throughput is periodically going down and up.
At 17:45:00, it shows zero disk read/sec.  -- This must mean that
your load is being completely served from cache. If you have a very
high cache hit rate CPU/Memory are the ONLY factor. If CPU and
memtables are the only factor then larger memtables will start to
perform slower then smaller memtables.

Possibly with SSD the conventional thinking on Larger SSTables does
not apply (at least for your active set)


Re: cassandra vs hbase summary (was facebook messaging)

2010-11-22 Thread Edward Capriolo
On Mon, Nov 22, 2010 at 2:52 PM, Todd Lipcon t...@lipcon.org wrote:
 On Mon, Nov 22, 2010 at 10:01 AM, David Jeske dav...@gmail.com wrote:

 I havn't used either Cassandra or hbase, so please don't take any part of
 this message as me attempting to state facts about either system. However,
 I'm very familiar with data-storage design details, and I've worked
 extensively optimizing applications running on MySQL, Oracle, berkeledb
 (including distributed txn berkeleydb), and Google Bigtable.
 The recent discussion triggered by Facebook messaging using HBase helped
 surface many interesting design differences in the two systems. I'm writing
 this message both to summarize what I've read in a few different places
 about that topic, and to check my facts.
 As far as I can descern, this is a decent summary of the consistency and
 performance differences between hbase and cassandra (N3/R2/W2 or N3/R1/W3)
 for an hbase acceptable workload.. (Please correct the fact if they appear
 wrong!)
 1) Cassandra can't replicate the consistency situation of HBase. Namely,
 that when a write requiring a quorum fails it will never appear. Deriving
 from this explanation:
 [In Cassandra]Provided at least one node receives the write, it will
 eventually be written to all replicas. A failure to meet the requested
 ConsistencyLevel is just that; not a failure to write the data itself. Once
 the write is received by a node, it will eventually reach all replicas,
 there is no roll back. - Nick Telford [ref]

 [In Hbase] The DFSClient call returns when all datanodes in the pipeline
 have flushed (to the OS buffer) and ack'ed. That code comes from HDFS-200 in
 the 0.20-append branch and HDFS-265 for all branches after 0.20, meaning
 that it's in 0.21.0 - Jean-Daniel Cryans [ref]
 in HBase, if a write is accepted by only 1 of 3 HDFS replicas; and the
 region master never receives a response from the other two replicas; and it
 fails the client write, that write should never appear. Even if the region
 master then fails, when a new region master is elected, and it restarts and
 recovers, it should read HDFS blocks and accept the consensus 2/3 opinion
 that the log does not contain the write -- dropping the write. The write
 will never be seen.

 Not quite. The replica synchronization code is pretty messy, but basically
 it will take the longest replica that may have been synced, not a quorum.
 i.e the guarantee is that if you successfully sync() data, it will be
 present after replica synchronization. Unsynced data *may* be present after
 replica synchronization.
 But keep in mind that recovery is blocking in most cases - ie if the RS is
 writing to a pipeline and waiting on acks, and one of the nodes in the
 pipeline dies, then it will recover the pipeline (without the dead node) and
 continue syncing to the remaining two nodes. The client is still blocked at
 this point.
 If the RS itself dies, then it won't respond to the client at all, and it's
 anyone's guess whether the write was successful or not. The same is true if
 the network between client and RS dies. This is unavoidable in any system -
 a server can always fail *just before* sending the success message, and
 the write is left in maybe written state.
 What will *not* happen, though, is the following case:
 - Row contains value A
 - Client writes value B
 - RS fails
 - Client reads value A
 - Client reads again and sees value B
 Similarly, if client reads value B, it won't revert to value A in any
 circumstance.


 In Cassandra, if a write (requesting 2 or 3 copies) is accepted by only
 one node, that write will fail to the client. Future reads R=1 will see that
 write or not depending on whether they contact the one server that accepted
 or not, until the data is propagated, at which time they will see the write.
 Reads R=2 will not see the write until it is propagated until at least two
 servers. There is no mechanism to assure that a write is either accepted by
 the requested number of servers or aborted.
 2) Cassandra has a less efficient memory footprint data pinned in memory
 (or cached). With 3 replicas on Cassandra, each element of data pinned
 in-memory is kept in memory on 3 servers, wheras in hbase only region
 masters keep the data in memory, so there is only one-copy of each data
 element.
 3) Cassandra (N3/W2/R2) has slower reads of cached or pinned-in-memory
 data. HBase can answer a read-only query that is in memory from the single
 region-master, while Cassandra (N3/W2/R2) must read from multiple servers.
  (note, N3/W2/R2 still doesn't produce the same consistency situation as
 hbase, see #1)

 Yes, probably - except that it seems to me Cassandra should be able to offer
 lower latency in the face of java GC pauses. If an HBase RS is in a 200ms GC
 pause, latency for all rows hosted by that server will spike to 200ms. If
 one of three replicas is in a 200ms GC pause, the other two replicas will
 still respond quickly so latency should be less spiky in 

Re: cassandra vs hbase summary (was facebook messaging)

2010-11-22 Thread Edward Capriolo
On Mon, Nov 22, 2010 at 2:56 PM, Edward Capriolo edlinuxg...@gmail.com wrote:
 On Mon, Nov 22, 2010 at 2:52 PM, Todd Lipcon t...@lipcon.org wrote:
 On Mon, Nov 22, 2010 at 10:01 AM, David Jeske dav...@gmail.com wrote:

 I havn't used either Cassandra or hbase, so please don't take any part of
 this message as me attempting to state facts about either system. However,
 I'm very familiar with data-storage design details, and I've worked
 extensively optimizing applications running on MySQL, Oracle, berkeledb
 (including distributed txn berkeleydb), and Google Bigtable.
 The recent discussion triggered by Facebook messaging using HBase helped
 surface many interesting design differences in the two systems. I'm writing
 this message both to summarize what I've read in a few different places
 about that topic, and to check my facts.
 As far as I can descern, this is a decent summary of the consistency and
 performance differences between hbase and cassandra (N3/R2/W2 or N3/R1/W3)
 for an hbase acceptable workload.. (Please correct the fact if they appear
 wrong!)
 1) Cassandra can't replicate the consistency situation of HBase. Namely,
 that when a write requiring a quorum fails it will never appear. Deriving
 from this explanation:
 [In Cassandra]Provided at least one node receives the write, it will
 eventually be written to all replicas. A failure to meet the requested
 ConsistencyLevel is just that; not a failure to write the data itself. Once
 the write is received by a node, it will eventually reach all replicas,
 there is no roll back. - Nick Telford [ref]

 [In Hbase] The DFSClient call returns when all datanodes in the pipeline
 have flushed (to the OS buffer) and ack'ed. That code comes from HDFS-200 in
 the 0.20-append branch and HDFS-265 for all branches after 0.20, meaning
 that it's in 0.21.0 - Jean-Daniel Cryans [ref]
 in HBase, if a write is accepted by only 1 of 3 HDFS replicas; and the
 region master never receives a response from the other two replicas; and it
 fails the client write, that write should never appear. Even if the region
 master then fails, when a new region master is elected, and it restarts and
 recovers, it should read HDFS blocks and accept the consensus 2/3 opinion
 that the log does not contain the write -- dropping the write. The write
 will never be seen.

 Not quite. The replica synchronization code is pretty messy, but basically
 it will take the longest replica that may have been synced, not a quorum.
 i.e the guarantee is that if you successfully sync() data, it will be
 present after replica synchronization. Unsynced data *may* be present after
 replica synchronization.
 But keep in mind that recovery is blocking in most cases - ie if the RS is
 writing to a pipeline and waiting on acks, and one of the nodes in the
 pipeline dies, then it will recover the pipeline (without the dead node) and
 continue syncing to the remaining two nodes. The client is still blocked at
 this point.
 If the RS itself dies, then it won't respond to the client at all, and it's
 anyone's guess whether the write was successful or not. The same is true if
 the network between client and RS dies. This is unavoidable in any system -
 a server can always fail *just before* sending the success message, and
 the write is left in maybe written state.
 What will *not* happen, though, is the following case:
 - Row contains value A
 - Client writes value B
 - RS fails
 - Client reads value A
 - Client reads again and sees value B
 Similarly, if client reads value B, it won't revert to value A in any
 circumstance.


 In Cassandra, if a write (requesting 2 or 3 copies) is accepted by only
 one node, that write will fail to the client. Future reads R=1 will see that
 write or not depending on whether they contact the one server that accepted
 or not, until the data is propagated, at which time they will see the write.
 Reads R=2 will not see the write until it is propagated until at least two
 servers. There is no mechanism to assure that a write is either accepted by
 the requested number of servers or aborted.
 2) Cassandra has a less efficient memory footprint data pinned in memory
 (or cached). With 3 replicas on Cassandra, each element of data pinned
 in-memory is kept in memory on 3 servers, wheras in hbase only region
 masters keep the data in memory, so there is only one-copy of each data
 element.
 3) Cassandra (N3/W2/R2) has slower reads of cached or pinned-in-memory
 data. HBase can answer a read-only query that is in memory from the single
 region-master, while Cassandra (N3/W2/R2) must read from multiple servers.
  (note, N3/W2/R2 still doesn't produce the same consistency situation as
 hbase, see #1)

 Yes, probably - except that it seems to me Cassandra should be able to offer
 lower latency in the face of java GC pauses. If an HBase RS is in a 200ms GC
 pause, latency for all rows hosted by that server will spike to 200ms. If
 one of three replicas is in a 200ms GC pause, the other

Re: cassandra vs hbase summary (was facebook messaging)

2010-11-22 Thread Edward Capriolo
On Mon, Nov 22, 2010 at 5:14 PM, Todd Lipcon t...@lipcon.org wrote:
 On Mon, Nov 22, 2010 at 1:58 PM, David Jeske dav...@gmail.com wrote:

 On Mon, Nov 22, 2010 at 11:52 AM, Todd Lipcon t...@lipcon.org wrote:

 Not quite. The replica synchronization code is pretty messy, but
 basically it will take the longest replica that may have been synced, not a
 quorum.
 i.e the guarantee is that if you successfully sync() data, it will be
 present after replica synchronization. Unsynced data *may* be present after
 replica synchronization.
 But keep in mind that recovery is blocking in most cases - ie if the RS
 is writing to a pipeline and waiting on acks, and one of the nodes in the
 pipeline dies, then it will recover the pipeline (without the dead node) and
 continue syncing to the remaining two nodes. The client is still blocked at
 this point.

 I see. So it sounds like my statement #1 was wrong. Will the RS ever
 timeout the write and fail in the face of not being able to push it to HDFS?
 Is it correct to say:
 Once a write is issued to HBase, it will either catistrophicly fail (i.e.
 disconnect), in which case the write with either have failed or succeeded,
 and if it succeeded, future reads will always show that write? As opposed to
 Cassandra, which in all configurations where reads allow a subset of all
 nodes, can fail a write while having the write show a temporary period of
 inconsistency (depending on who you talk to) followed by the write either
 applying or not applying depending on whether or not it actually wrote a
 single node during the failure to meet the write consistency request?

 Yes, this seems accurate to me.


 Does Cassandra have any return result which distinguishes between these
 two states:
 1 - your data was not written to any nodes (true failure)
 2 - your data was written to at least 1 node, but not enough to meet your
 write-consistency count
 ?






David ,
Return messages such as your data was written to at least 1 node but
not enough to make your write-consistency count. Do not help the
situation. As the client that writes the data would be aware of the
inconsistency, but the other clients would not. Thus it only makes
sense to pass or fail entirely. (Thought it could be an interesting
error message)

Right, CASSANDRA-1314 only solves the memory overhead issue.

Another twist to throw in the losing writes conversation is that
file systems can lose writes as well :) Unless you are choosing many
synchronous options that most people do not use (IMHO)

@Todd. Good catch about caching HFile blocks.

My point still applies though. Caching HFIle blocks on a single node
vs individual dataums on N nodes may not be more efficient. Thus
terms like Slower and Less Efficient could be very misleading.

Isn't caching only the item more efficient? In cases with high random
read is evicting single keys more efficient then evicting blocks in
terms of memory churn?

These are difficult questions to answer absolutely so seeing bullet
points such as '#Cassandra has slower this' are oversimplifications of
complex problems.


Re: cassandra vs hbase summary (was facebook messaging)

2010-11-22 Thread Edward Capriolo
On Mon, Nov 22, 2010 at 5:48 PM, David Jeske dav...@gmail.com wrote:


 On Mon, Nov 22, 2010 at 2:44 PM, David Jeske dav...@gmail.com wrote:

 On Mon, Nov 22, 2010 at 2:39 PM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

 Return messages such as your data was written to at least 1 node but
 not enough to make your write-consistency count. Do not help the
 situation. As the client that writes the data would be aware of the
 inconsistency, but the other clients would not. Thus it only makes
 sense to pass or fail entirely. (Thought it could be an interesting
 error message)


 I should have thought about that before I sent it. Let me rephrase.
 Doesn't the current return message actually mean your data was written to
 between 0 and N nodes, but not enoguh to make your write-consistency count?

 I agree with you that your data was written to at least 1 node but not
 enough to make your write consistency count is not that useful. However,
 the current failure seems to merge a real failure (i.e. your data will
 never show up) with a possible failure (your data might show up)
 Personally I'd really ilke to know if my data was not written at all, and
 that has a very different meaning than my data was sort-of-written, but not
 replicated as widely as I'd like, but it someday might be, or it someday
 might not.



If you get an UnavailableException or a TimedOutException you client
needs to retry the write. The point is Cassandra has tunable
consistency you can say things such as:
I want to write to any node and so even if all replicas for key are
down so it will get there later WRITE.ANY
or
I want to write to all nodes and get an exception if it does not pass
on all nodes. WRITE.ALL
or
I want to write to One node that is a replica WRITE.ONE
or
I want consistent data Read.QUORUM + WRITE.QUORUM

Also from your comments above you are not taking into account there
are two halves of the equation. Read and Write. If you mix the two
levels you can solve many of those concerns.

Cassandra is a distributed system. It is NOT just like hbase.  If you
are worried about the edge cases associated with node failures
Cassandra may not be for you. See
http://en.wikipedia.org/wiki/CAP_theorem.

However as you pointed out in item #5 if you lose a region server you
are not going to be able to read or write that data (at all)

http://www.mail-archive.com/hbase-u...@hadoop.apache.org/msg09989.html

This poster talks about 3-4 minutes of outage. If you want consistency
like hbase you have to live with that outage.


Re: monitoring read and write problems via log file?

2010-11-24 Thread Edward Capriolo
On Wed, Nov 24, 2010 at 3:04 AM, Peter Schuller
peter.schul...@infidyne.com wrote:
 I was told by a colleague that read and write problems in Cassandra can be
 detected by monitoring a Cassandra log file.

 What do you mean by problem? If you mean something like a hard I/O
 error or corruption causing an internal error, you should get an
 exception of some kind in the system log (typically
 /var/log/cassandra/output.log or similar, unless otherwise
 configured).

 --
 / Peter Schuller


At the default log level of info you should look for
DroppedMessageLogger, -- backpressure is causing failures
GCInspector, --garbage collector paused
optimal bloom filter -- not sure this is critical but appears at times
Large row -- message from compaction about a really large row
STATE Down -- message from gossip about node flap
STATE UP  -- message from gossip about node flap
Digest mismatch exception --Quorum read fixed data (I do not see this much)

I use a log4j syslog appender to send info to our splunk/syslog
station.  I use splunk to count these events based on time buckets.


Re: Capacity problem with a lot of writes?

2010-11-26 Thread Edward Capriolo
On Fri, Nov 26, 2010 at 10:49 AM, Peter Schuller
peter.schul...@infidyne.com wrote:
 Making compaction parallel isn't a priority because the problem is
 almost always the opposite: how do we spread it out over a longer
 period of time instead of sharp spikes of activity that hurt
 read/write latency.  I'd be very surprised if latency would be
 acceptable if you did have parallel compaction.  In other words, your
 real problem is you need more capacity for your workload.

 Do you expect this to be true even with the I/O situation improved
 (i.e., under conditions where the additional I/O is not a problem)? It
 seems counter-intuitive to me that single-core compaction would make a
 huge impact on latency when compaction is CPU bound on a 8+ core
 system under moderate load (even taking into account cache
 coherency/NUMA etc).

 --
 / Peter Schuller


Carlos,

I wanted to mention a specific technique I used to solve a situation I
ran into. We had a large influx of data that pushed at our current
hardware, as stated above the true answer was more hardware. However
we ran into a situation where a single node failed several large
compactions. We failed 2 or 3 big compactions we ended up with ~1000
SSTables for a column family.

This turned into a chicken and egg situation where reads were slow
because there were many sstables and extra data like tombstones.
However the compaction was brutally slow from the read/write traffic.

My solution was to create a side by side install on the same box, I
used different data directories and different ports,
/var/lib/cassandra/compact 9168 etc, moved the data to the new install
and started it up. Then I ran nodetool compact on the new instance.
This node was seeing no read or write traffic.

I was surprised to see the machine was at 400%/1600% CPU used and not
much io-wait. Compacting 600 GB of small SSTables took about 4 days.
(However when sstables are larger I have compacted 400GB in 4 hours on
the same hardware.)

After which I moved the data file back in place and started the node
back into the cluster. I have lived on both sides of the fence where i
want long slow compactions or breakneck fast ones.

I believe there is room for other compaction models. I am interested
in systems that can optimize the case with multiple data directories
for example. It seems like from my experiment a major compaction can
not fully utilize hardware is specific conditions. Although knowing
which ones to use where and how to automatically select the optimal
strategy are interesting concerns.


Re: Using mySQL to emulate Cassandra

2010-11-28 Thread Edward Capriolo
On Sun, Nov 28, 2010 at 11:35 AM, Tom Melendez t...@supertom.com wrote:
 On Sun, Nov 28, 2010 at 12:28 AM, David Boxenhorn da...@lookin2.com wrote:
 As our launch date approaches, I am getting increasingly nervous about
 Cassandra tuning. It is a mysterious black art that I haven't mastered even
 at the low usages that we have now. I know of a few more things I can do to
 improve things, but how will I know if it is enough? All this is
 particularly ironic since - as we are just starting out - we don't have
 scalability problems yet, though we hope to!

 How are your load tests looking?  Of course, there's nothing like
 going live, but I expect you'll be able to simulate 2x-3x your initial
 launch traffic.

 Luckily, I have completely wrapped Cassandra in an entity mapper, so that I
 can easily trade in something else, perhaps temporarily, until we really
 need Cassandra's scalability.

 So, I'm thinking of emulating Cassandra with mySQL. I would use mySQL either
 as a simple key-value store, without joins, or map Cassandra supercolumns to
 mySQL columns, probably of type CLOB.

 Does anyone want to talk me out of this?


 As you said, I think you just have some cold feet.

 My feeling is that you did some original research and decided on
 Cassandra for various reasons.  I think if you put the MySQL solution
 in now, you won't go back to the Cassandra solution, because once its
 live, it will be much riskier to switch.  And if you feel you made a
 mistake in your original assessment, then great, at least you found
 out before launch.

 Whatever you choose, I would flesh out my my fears with as much detail
 as possible.  Invest in load tests and develop contingency plans.  I
 talked about this in 2009 a little bit here - see slide 22, we call
 these Defcon Levels.

 http://www.slideshare.net/supertom/building-configurable-applications-for-the-web

 The idea is prioritizing what REALLY is important if the shit hits the
 fan (watch out, biz folks think everything is always important) and
 having processes to implemen and knobs to turn and levers to pull
 should you get slashdotted (or facebooked, tweeted, oprahed,
 techcrunched or whatever we call it these days).

 Good luck with your launch.

 Thanks,

 Tom


You should always worry about everything, but you should also have
confidence in your decisions. If your worry is how your cluster will
perform under load, then you should find a way to test under load.
Tweeks and tunes do not make scalability (they help), hardware does.
If you want to be ready to be 'slashdotted' you better have a rack of
servers idling.

If you just need a key-value store you may not need Cassandra.
Cassandra is scalable in a different way then MySQL would be.

You want convincing... (Ill try)
Cassandra shards through node joins and handles replication. If you
start off with a Mysql master/slave architecture, or using id
hash(key) mod 3. It is not clear how you grow that cluster with
demand.

If you make a choice that is not scalable, when you get 'slashdotted'
you will not be ready. What is worse you will have no easy way out of
the problem.


Re: get_count - cassandra 0.7.x predicate limit bug?

2010-11-30 Thread Edward Capriolo
On Tue, Nov 30, 2010 at 1:00 AM, Tyler Hobbs ty...@riptano.com wrote:
 What error are you getting?

 Remember, get_count() is still just about as much work for cassandra as
 getting the whole row; the only advantage is it doesn't have to send the
 whole row back to the client.

 If you're counting 3+ million columns frequently, it's time to take a look
 at counters.

 - Tyler

 On Fri, Nov 26, 2010 at 10:33 AM, Marcin mar...@33concept.com wrote:

 Hi guys,

 I have a key with 3million+ columns but when I am trying to run get_count
 on it its getting me error if setting limit more than 46000+ any ideas?

 In previous API there was no predicate at all so it was simply counting
 number of columns now its not so simple any more.

 Please let me know if that is a bug or I do something wrong.


 cheers,
 /Marcin



+1 Tyler. The problem is you can increase the clients socket timeout
as high as you like if socketTimeout  rpcTimeout you should see
SocketTimeoutExceptions if socketTimeout = rcpTimeout you start
seeing Cassandra TimedOutExceptions. Raising the RPC Timeout is done
on the server. In any case you may have to range_slice to get through
a row this big and count. Also in my experience rows this large do not
work well. They are particularly dangerous when combined with RowCache
as bringing them into to memory and evicting them is both disk and
memory intensive.


Re: how to see how many rows in each node?

2010-12-03 Thread Edward Capriolo
On Fri, Dec 3, 2010 at 12:53 PM, Robert Coli rc...@digg.com wrote:
 On 12/3/10 6:09 AM, Jonathan Ellis wrote:

 Divide space used by average row size from cfstats

 On Fri, Dec 3, 2010 at 7:58 AM, Donal Zangzan...@ihep.ac.cn  wrote:

 RT.
 Is there any command or api?

 In 0.6.x :

 strings /path/to/cassandra/data/Keyspace/*-Index.db | wc -l

 =Rob




7.0 Has an estimate keys function available somewhere inside JConsole.


Running multiple instances on a single server --micrandra ??

2010-12-07 Thread Edward Capriolo
I am quite ready to be stoned for this thread but I have been thinking
about this for a while and I just wanted to bounce these ideas of some
guru's.

Cassandra does allow multiple data directories, but as far as I can
tell no one runs in this configuration. This is something that is very
different between the hbase architecture and the Cassandra
architecture. HBase borrows the concept from hadoop of JBOD
configurations. HBase has many small ish (~256 MB) regions managed
with Zookeeper. Cassandra has a few (1 per node) large node sized
Token Ranges managed by Gossip consensus.

Lets say a node has 6 300 GB disks. You have the options of RAID5,
RAID6, RAID10, or RAID0. The problem I have found with these
configurations are major compactions (of even large minor ones) can
take a long time. Even if your disk is not heavily utilized this is a
lot of data to move through. Thus node joins take a long time. Node
moves take a long time.

The idea behind micrandra is for a 6 disk system run 6 instances of
Cassandra, one per disk. Use the RackAwareSnitch to make sure no
replicas live on the same node.

The downsides
1) we would have to manage 6x the instances of cassandra
2) we would have some overhead for each JVM.

The upsides ?
1) Since disk/instance failure only degrades the overall performance
1/6th (RAID0 you lost the entire node) (RAID5 still takes a hit when
down a disk)
2) Moves and joins have less work to do
3) Can scale up a single node by adding a single disk to an existing
system (assuming the ram and cpu is light)
4) OPP would be easier to balance out hot spots (maybe not on this
one in not an OPP)

What does everyone thing? Does it ever make sense to run this way?


Re: Running multiple instances on a single server --micrandra ??

2010-12-10 Thread Edward Capriolo
On Thu, Dec 9, 2010 at 10:40 PM, Bill de hÓra b...@dehora.net wrote:


 On Tue, 2010-12-07 at 21:25 -0500, Edward Capriolo wrote:

 The idea behind micrandra is for a 6 disk system run 6 instances of
 Cassandra, one per disk. Use the RackAwareSnitch to make sure no
 replicas live on the same node.

 The downsides
 1) we would have to manage 6x the instances of cassandra
 2) we would have some overhead for each JVM.

 The upsides ?
 1) Since disk/instance failure only degrades the overall performance
 1/6th (RAID0 you lost the entire node) (RAID5 still takes a hit when
 down a disk)
 2) Moves and joins have less work to do
 3) Can scale up a single node by adding a single disk to an existing
 system (assuming the ram and cpu is light)
 4) OPP would be easier to balance out hot spots (maybe not on this
 one in not an OPP)

 What does everyone thing? Does it ever make sense to run this way?

 It might for read heavy loads.

 When I looked at this, it was pointed out to me it's simpler to run fewer
 bigger coarser nodes and take the entire node/server out when something goes
 wrong. Basically give each Cassandra a server.

 I wonder if it would be better to rethink compaction if that's what's
 driving the idea. It seems to what is biting everyone, along with GC.

 Bill

Having 6 IP's on a machine would be a given in this setup. That is not
an issue for me.

It is not biting me. We all know that going from 10-20 nodes is
pretty simple. However organic growth from 10-16, then a couple months
later from 16 - 22, can take some effort with 300-600 GB per node,
since each join and clean up can take a while. I am wondering if
dividing a single large node into multiple smaller instances would
make this type of growth easier.


Re: Running multiple instances on a single server --micrandra ??

2010-12-10 Thread Edward Capriolo
On Fri, Dec 10, 2010 at 11:39 PM, Edward Capriolo edlinuxg...@gmail.com wrote:
 On Thu, Dec 9, 2010 at 10:40 PM, Bill de hÓra b...@dehora.net wrote:


 On Tue, 2010-12-07 at 21:25 -0500, Edward Capriolo wrote:

 The idea behind micrandra is for a 6 disk system run 6 instances of
 Cassandra, one per disk. Use the RackAwareSnitch to make sure no
 replicas live on the same node.

 The downsides
 1) we would have to manage 6x the instances of cassandra
 2) we would have some overhead for each JVM.

 The upsides ?
 1) Since disk/instance failure only degrades the overall performance
 1/6th (RAID0 you lost the entire node) (RAID5 still takes a hit when
 down a disk)
 2) Moves and joins have less work to do
 3) Can scale up a single node by adding a single disk to an existing
 system (assuming the ram and cpu is light)
 4) OPP would be easier to balance out hot spots (maybe not on this
 one in not an OPP)

 What does everyone thing? Does it ever make sense to run this way?

 It might for read heavy loads.

 When I looked at this, it was pointed out to me it's simpler to run fewer
 bigger coarser nodes and take the entire node/server out when something goes
 wrong. Basically give each Cassandra a server.

 I wonder if it would be better to rethink compaction if that's what's
 driving the idea. It seems to what is biting everyone, along with GC.

 Bill

 Having 6 IP's on a machine would be a given in this setup. That is not
 an issue for me.

 It is not biting me. We all know that going from 10-20 nodes is
 pretty simple. However organic growth from 10-16, then a couple months
 later from 16 - 22, can take some effort with 300-600 GB per node,
 since each join and clean up can take a while. I am wondering if
 dividing a single large node into multiple smaller instances would
 make this type of growth easier.


To clearly explain the scenario. 5 nodes cluster each node has 20 %
ring. They each have 6 disks. ~ 200 GB data.
Going to 10 nodes is easy. You can join each one directly between each node.

However if you are going from say 5 - 8. This gets dicey. Do you
calculate the ideal ring position for 10 nodes?
20% | 20% | 10% | 10% | 10% | 10% | 10% | 10%  This results in three
joins and several clean ups. With this choice you save time but hope
you do not get to the point where the first two nodes get overloaded.

If you decide to work with the ideal tokens for 8 you have many moves
joins. Until we have:

https://issues.apache.org/jira/browse/CASSANDRA-1418
https://issues.apache.org/jira/browse/CASSANDRA-1427

Having 6 smaller instances on a node with 6 disks. Would make it
easier to keep close to balanced without having to double your cluster
size each time you grow or doing a series of moves to get balanced
again.


Re: N to N relationships

2010-12-12 Thread Edward Capriolo
On Sun, Dec 12, 2010 at 3:20 AM, David Boxenhorn da...@lookin2.com wrote:
 You want to store every value twice? That would be a pain to maintain, and
 possibly lead to inconsistent data.

 On Fri, Dec 10, 2010 at 3:50 AM, Nick Bailey n...@riptano.com wrote:

 I would also recommend two column families. Storing the key as NxN would
 require you to hit multiple machines to query for an entire row or column
 with RandomPartitioner. Even with OPP you would need to pick row or columns
 to order by and the other would require hitting multiple machines.  Two
 column families avoids this and avoids any problems with choosing OPP.

 On Thu, Dec 9, 2010 at 2:26 PM, Aaron Morton aa...@thelastpickle.com
 wrote:

 Am assuming you have one matrix and you know the dimensions. Also as you
 say the most important queries are to get an entire column or an entire row.
 I would consider using a standard CF for the Columns and one for the
 Rows.  The key for each would be the col / row number, each cassandra column
 name would be the id of the other dimension and the value whatever you want.

 - when storing the data update both the Column and Row CF
 - reading a whole row/col would be simply reading from the appropriate
 CF.
 - reading an intersection is a get_slice to either col or row CF using
 the column_names field to identify the other dimension.
 You would not need secondary indexes to serve these queries.
 Hope that helps.
 Aaron
 On 10 Dec, 2010,at 07:02 AM, Sébastien Druon sdr...@spotuse.com wrote:

 I mean if I have secondary indexes. Apparently they are calculated in the
 background...

 On 9 December 2010 18:33, David Boxenhorn da...@lookin2.com wrote:

 What do you mean by indexing?


 On Thu, Dec 9, 2010 at 7:30 PM, Sébastien Druon sdr...@spotuse.com
 wrote:

 Thanks a lot for the answer
 What about the indexing when adding a new element? Is it incremental?
 Thanks again


 On 9 December 2010 14:38, David Boxenhorn da...@lookin2.com wrote:

 How about a regular CF where keys are n...@n ?

 Then, getting a matrix row would be the same cost as getting a matrix
 column (N gets), and it would be very easy to add element N+1.



 On Thu, Dec 9, 2010 at 1:48 PM, Sébastien Druon sdr...@spotuse.com
 wrote:

 Hello,
 For a specific case, we are thinking about representing a N to N
 relationship with a NxN Matrix in Cassandra.
 The relations will be only between a subset of elements, so the
 Matrix will mostly contain empty elements.
 We have a set of questions concerning this:
 - what is the best way to represent this matrix? what would have the
 best performance in reading? in writing?
   . a super column family with n column families, with n columns each
   . a column family with n columns and n lines
 In the second case, we would need to extract 2 kinds of information:
 - all the relations for a line: this should be no specific problem;
 - all the relations for a column: in that case we would need an index
 for the columns, right? and then get all the lines where the value of 
 the
 column in question is not null... is it the correct way to do?
 When using indexes, say we want to add another element N+1. What
 impact in terms of time would it have on the indexation job?
 Thanks a lot for the answers,
 Best regards,
 Sébastien Druon






Before secondary indexes the only option was to store the data twice.
Yes you have to maintain this yourself. The data model only provides
fast searches on the key. An index normally a separate entity with
different ordering, almost the same here.


Re: unable to start cassandra-0.7r2

2010-12-13 Thread Edward Capriolo
On Mon, Dec 13, 2010 at 5:45 PM, Eric Evans eev...@rackspace.com wrote:
 On Mon, 2010-12-13 at 17:27 -0500, Liangzhao Zeng wrote:
 I can run the 0.66 using same logging setup without any problem. Not
 sure what's the difference when starting up the 0.7 in eclipse. Can
 someone share the logging setup?

 Make sure that you have -Dlog4j.configuration=log4j-server.properties
 among your VM arguments and that conf/ (assuming that's where you have
 it) has been added to the classpath.  Since you say this worked with
 0.6.6 and doesn't with 0.7, I'm guessing the latter is already in place
 and the former is the problem.

 --
 Eric Evans
 eev...@rackspace.com



I am not sure about the logging but cassandra.config should now be a
URI to your cassandra.yaml not your storage-dir. Mine looks like this.

-Dcom.sun.management.jmxremote.port=
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcassandra-foreground
-Dcassandra.config=file:///home/edward/idea/conf/cassandra.yaml -ea
-Xmx1G


Re: Read Latency Degradation

2010-12-17 Thread Edward Capriolo
On Fri, Dec 17, 2010 at 8:21 AM, Wayne wav...@gmail.com wrote:
 We have been testing Cassandra for 6+ months and now have 10TB in 10 nodes
 with rf=3. It is 100% real data generated by real code in an almost
 production level mode. We have gotten past all our stability issues,
 java/cmf issues, etc. etc. now to find the one thing we assumed may not be
 true. Our current production environment is mysql with extensive
 partitioning. We have mysql tables with 3-4 billion records and our query
 performance is the same as with 1 million records ( 100ms).

 For those of us really trying to manage large volumes of data memory is not
 an option in any stretch of the imagination. Our current data volume once
 placed within Cassandra ignoring growth should be around 50 TB. We run
 manual compaction once a week (absolutely required to keep ss table counts
 down) and it is taking a very long amount of time. Now that our nodes are
 past 1TB I am worried it will take more than a day. I was hoping everyone
 would respond to my posting with something must be wrong, but instead I am
 hearing you are off the charts good luck and be patient. Scary to say the
 least given our current investment in Cassandra. Is it true/expected that
 read latency will get worse in a linear fashion as the ss table size grows?

 Can anyone talk me off the fence here? We have 9 MySQL servers that now
 serve up 15+TB of data. Based on what we have seen we need 100 Cassandra
 nodes with rf=3 to give us good read latency (by keeping the node data sizes
 down). The cost/value equation just does not add up.

 Thanks in advance for any advice/experience you can provide.


 On Fri, Dec 17, 2010 at 5:07 AM, Daniel Doubleday daniel.double...@gmx.net
 wrote:

 On Dec 16, 2010, at 11:35 PM, Wayne wrote:

  I have read that read latency goes up with the total data size, but to
  what degree should we expect a degradation in performance? What is the
  normal read latency range if there is such a thing for a small slice of
  scol/cols? Can we really put 2TB of data on a node and get good read 
  latency
  querying data off of a handful of CFs? Any experience or explanations would
  be greatly appreciated.

 If you really mean 2TB per node I strongly advise you to perform thorough
 testing with real world column sizes and the read write load you expect. Try
 to load test at least with a test cluster / data that represents one
 replication group. I.e. RF=3 - 3 nodes. And test with the consistency level
 you want to use. Also test ring operations (repair, adding nodes, moving
 nodes) while under expected load/

 Combined with 'a handful of CFs' I would assume that you are expecting a
 considerable write load. You will get massive compaction load and with that
 data size the file system cache will suffer big time. You'll need loads of
 RAM and still ...

 I can only speak about 0.6 but ring management operations will become a
 nightmare and you will have very long running repairs.

 The cluster behavior changes massively with different access patterns
 (cold vs warm data) and data sizes. So you have to understand yours and test
 it. I think most generic load tests are mainly marketing instruments and I
 believe this is especially true for cassandra.

 Don't want to sound negative (I am a believer and don't regret our
 investment) but cassandra is no silver bullet. You really need to know what
 you are doing.

 Cheers,
 Daniel


Yes major compactions for large sets of data do take a long time
(360GB takes me about 6 hours).

You said needing to compact to keep the sstable count low. This is
not a good sign. My sstable counts sawtooth between 8-15 per CF
through the day. If you are in a scenario where the SSTables are
growing all day and only catch up at night, and you have tuned
memtables, then your need more nodes likely. This means that your
cluster can not really keep up with your write traffic. You know
cassandra can take bursts of writes well, but if you are at the case
where your sstables count is getting higher you are essentially
failing behind. (You may not need 100 nodes like you are suggesting
but possibly a few to get you over the fence.)

I do run major compactions at night, but not on every night on every
node. I do one a node a night to make sure these are splayed out over
the week, With deletes on non-major compactions you may not need to do
this, but we add and remove a lot of data per day so I find I have
to/should. Since the nights are quite for us anyway.

As for how many nodes you need...What works out better ?
Big Iron: 1x (2 TB 64 GB RAM ) cost ? power ? Rack size ?
Small factor: 4x (500GB  16GB RAM) cost ? power ? Rack Size ?
Generally I think most are running the small factor type deployment,
and generally this works better by avoiding 2GB compactions!

Is it true that read latency grows linearly with sstable size? No (but
it could be true in your case).

As for your specific problems. More info is needed.

How many nodes?
How much ram 

Re: Cassandra Monitoring

2010-12-17 Thread Edward Capriolo
On Fri, Dec 17, 2010 at 5:48 AM, Daniel Doubleday
daniel.double...@gmx.net wrote:
 Hi all
 just wanted to share a simple way we use to monitor cassandra internals with
 zabbix.
 We use a minimal http server which reads jmx and shows returns them in a
 property form. Thats read by zabbix every 30secs.
 That's started together with cassandra:
 https://gist.github.com/744761
 Output looks something like:
 d...@caladan[~]$ curl http://b22:9090/jmxexport
 OperationMode=Normal
 Load=151.379
 ReadOperations=506334
 WriteOperations=865867
 TotalReadLatencyMicros=6663882635
 TotalWriteLatencyMicros=352292885
 BytesCompacted=0
 BytesTotalInProgress=0
 PendingTasks=0
 HeapUsed=1153810280
 How / what are you monitoring? Best practices someone?
 Cheers,
 Daniel Doubleday,
 smeet.com, Berlin

Using cacti and -  http://www.jointhegrid.com/cassandra/cassandra-cacti-m6.jsp
Many people are using munin good support there.

Best Bractices:
Monitor SSTable sizes and growth.
Monitor Reads/Write sec
Monitor Cache hit rate
Monitor Compactions (what % of the day and average node is compacting)
Monitor SSTable count (make sure you do not have to many)
Monitor IO wait. (make sure you are not disk bound)
Monitor JVM memory (make sure you have some overhead for bursts of traffic)


Re: Read Latency Degradation

2010-12-17 Thread Edward Capriolo
On Fri, Dec 17, 2010 at 12:26 PM, Daniel Doubleday
daniel.double...@gmx.net wrote:
 How much ram is dedicated to cassandra? 12gb heap (probably too high?)
 What is the hit rate of caches? high, 90%+

 If your heap allows it I would definitely try to give more ram for fs cache.
 Your not using row cache so I don't see what cassandra would gain from so
 much memory.
 A question about your tests:
 I assume that they run isolated (you load test one cf at a time) and the
 results are the same byte-wise?
 So the only difference is that one time you are reading from a larger file?
 Do you see the same IO load in both tests? Do you use mem-mapped io? And if
 so are the number of page faults the same in both tests?
 In the end it could just be more physical movements of the disc heads with
 larger files ...

 On Dec 17, 2010, at 5:46 PM, Wayne wrote:

 Below are some answers to your questions. We have wide rows (what we like
 about Cassandra) and I wonder if that plays into this? We have been loading
 1 keyspace in our cluster heavily in the last week so it is behind in
 compaction for that keyspace. I am not even looking at those read latency
 times as there are as many as 100+ sstables. Compaction will run tomorrow
 for all nodes (weekend is our slow time) and I will test the read latency
 there. For the keyspace/CFs that are already well compacted we are seeing a
 steady increase in read latency as the total sstable size grows and a linear
 relationship between our different keyspaces cfs sizes and the read latency
 for reads.

 How many nodes? 10 - 16 cores each (2 x quad ht cpus)
 How much ram per node? 24gb
 What disks and how many? SATA 7200rpm 1x1tb for commit log, 4x1tb (raid0)
 for data
 Is your ring balanced? yes, random partitioned very evenly
 How many column families? 4 CFs x 3 Keyspaces
 How much ram is dedicated to cassandra? 12gb heap (probably too high?)
 What type of caching are you using? Key caching
 What are the sizes of caches? 500k-1m values for 2 of the CFs
 What is the hit rate of caches? high, 90%+
 What does your disk utiliztion|CPU|Memory look like at peak times? Disk goes
 to 90%+ under heavy read load. CPU load high as well. Latency does not
 change that much for single reads vs. under load (30 threads). We can keep
 current read latency up to 25-30 read threads if no writes or compaction is
 going on. We are worried about what we see in terms of latency for a single
 read.
 What are your average mean|max row size from cfstats? 30k avg/5meg max for
 one CF and 311k avg/855k max for the other.
 On average for a given sstable how large is the data bloom and index files?
 30gig data, 189k filter, 5.7meg index for one CF, 98gig data, 587k filter,
 18meg index for the other.

 Thanks.



 On Fri, Dec 17, 2010 at 10:58 AM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

 On Fri, Dec 17, 2010 at 8:21 AM, Wayne wav...@gmail.com wrote:
  We have been testing Cassandra for 6+ months and now have 10TB in 10
  nodes
  with rf=3. It is 100% real data generated by real code in an almost
  production level mode. We have gotten past all our stability issues,
  java/cmf issues, etc. etc. now to find the one thing we assumed may
  not be
  true. Our current production environment is mysql with extensive
  partitioning. We have mysql tables with 3-4 billion records and our
  query
  performance is the same as with 1 million records ( 100ms).
 
  For those of us really trying to manage large volumes of data memory is
  not
  an option in any stretch of the imagination. Our current data volume
  once
  placed within Cassandra ignoring growth should be around 50 TB. We run
  manual compaction once a week (absolutely required to keep ss table
  counts
  down) and it is taking a very long amount of time. Now that our nodes
  are
  past 1TB I am worried it will take more than a day. I was hoping
  everyone
  would respond to my posting with something must be wrong, but instead I
  am
  hearing you are off the charts good luck and be patient. Scary to say
  the
  least given our current investment in Cassandra. Is it true/expected
  that
  read latency will get worse in a linear fashion as the ss table size
  grows?
 
  Can anyone talk me off the fence here? We have 9 MySQL servers that now
  serve up 15+TB of data. Based on what we have seen we need 100 Cassandra
  nodes with rf=3 to give us good read latency (by keeping the node data
  sizes
  down). The cost/value equation just does not add up.
 
  Thanks in advance for any advice/experience you can provide.
 
 
  On Fri, Dec 17, 2010 at 5:07 AM, Daniel Doubleday
  daniel.double...@gmx.net
  wrote:
 
  On Dec 16, 2010, at 11:35 PM, Wayne wrote:
 
   I have read that read latency goes up with the total data size, but
   to
   what degree should we expect a degradation in performance? What is
   the
   normal read latency range if there is such a thing for a small
   slice of
   scol/cols? Can we really put 2TB of data on a node and get good read
   latency

Re: Which Java on Fedora? Sun's or GNU's?

2010-12-29 Thread Edward Capriolo
On Wed, Dec 29, 2010 at 11:29 AM, Eric Evans eev...@rackspace.com wrote:
 On Wed, 2010-12-29 at 10:56 -0500, Edward Capriolo wrote:
 Cassandra pushes your JVM hard. Do not count on your distro which
 might provide versions of things that are 3 months to 2 years old.

 Come on.  If it worked fine 3 months ago, then chances are it will
 continue to.  This is one of the reasons that people choose
 (environmentally )stable distro releases (which are often supported for
 much longer than 2 years).

 Chosing what your distro gives your prepare to be disappointed and
 have to upgrade as soon as you get some respectable load.  If you are
 using sun/oracle (That still feels strange to say JVM oracle) you want
 something much higher then just 1.6.0. Go for the latest and greatest
 1.6.21 or higher JRE/JDK 1.6.23.

 FWIW, the wiki says: For Sun's jvm, this means at least u19; u21 is
 better.

 I install the JDK (not the JRE) because its a super set and hey I just
 might feel like compiling something.

 Other not so great options... rpm -Uvh --force --skip-deps (If you
 know you have a Java that your RPM manager does not know about)

 No.  If this is really the situation, then it's disingenuous to offer
 the package at all, and it should be dropped.   I don't think these
 command line arguments should ever appear on a public mailing list.

 Get source RPM strip out the Java dependency (If you know you have a
 Java that your RPM manager does not know about)
 Create a source RPM with nothing in it that PROVIDES JAVA (If you
 know you have a Java that your RPM manager does not know about)

 --
 Eric Evans
 eev...@rackspace.com


If it worked fine three months ago and you came into Cassandra IRC
with a random JVM problem the first thing someone would tell you to do
is probably update to the latest JVM :)

Some distro's go for perceived stability over bug/performance
enhancements in there package choices. For example (a major unnamed
linux distribution) still ships mysql 5.0 rather then 5.1, or
BerkelyDB that NEVER gets upgraded. Why? Tracking these packages and
all the downstream changes from code that links to mysql or BDB would
result in way to much churn, that would make them look less stable and
enterprise like.

Another major distribution allows anyone to submit a package, as a
result they end up with hundreds/thousands of packages that NEVER get
updated or supported in any meaningful way.

As for Cassandra there are two key components Java and Cassandra. If
you are just taking whatever the distro gives you for these things,
you should probably do more research.

As to not letting the cat out of the bag on what you can do with RPM.
I agree, half heartedly. RPM is a glorified tar, and when it begins
insisting you need 40 dependent libraries you do not really need
(which is very common especially in the RPM Java world) because some
applet in the buried in an example somewhere just might need
x11Well I am more likely to edit the source RPM and make myself
happy then let RPM install all of gnome just so the RPM is happy.

In this case OpenJDK or SUN should meet the java =1.6.0 requirement.

Edward


Re: The size of the data, I must be doing smth wrong....

2011-01-05 Thread Edward Capriolo
On Wed, Jan 5, 2011 at 9:52 AM, Jonathan Ellis jbel...@gmail.com wrote:
 It's normal for Cassandra to use more disk space than MySQL.  It's
 part of what we trade for not having to rewrite every row when you add
 a new column.

 SSTables that are obsoleted by a compaction are deleted
 asynchronously when the JVM performs a GC.
 http://wiki.apache.org/cassandra/MemtableSSTable

 On Wed, Jan 5, 2011 at 8:35 AM, nicolas lattuada
 nicolaslattu...@hotmail.fr wrote:
 Hi

 i have some data size issues:

 i am storing super columns with the following content:

 {a=1, b=2, c=3...n=14}

 i am storing it 300 000 times and i have a data size on the disk about 283Mo

 And in other side i have a mysql table which stores a bunch of data the
 schema follows:
 6 varchars +100
 5 ints +6

 I put about 1 300 000 records on it and end up with 150Mo of data and 57Mo
 of index.

 Then i think i am certainly doing something wrong...

 The other thing is when i run flush and then compact the size of my data
 increases, then i imagine something is copied up on compaction
 So is there a way to remove the unused data? (cleanup doesn t seem to do the
 job).

 Any help to reduce the size of the data would be greatly apreciated!
 Greetings





 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of Riptano, the source for professional Cassandra support
 http://riptano.com


Unlike datastores that are delimited or have fixed column sizes
Cassandra does not. Each row is a Sorted Map of columns. A Column is a
tupple of {columnname,columnvalue,time}. Also the data is not stored
as tersely as it is inside mysql.


Re: Is this a good schema design to implement a social application..

2011-01-08 Thread Edward Capriolo
On Fri, Jan 7, 2011 at 11:38 PM, Rajkumar Gupta rajkumar@gmail.com wrote:
 In the twissandra example,
 http://www.riptano.com/docs/0.6/data_model/twissandra#adding-friends ,
 I find that they have split the materialized view of a user's homepage
 (like his followers list, tweets from friends) into several
 columnfamilies instead of putting in supercolumns inside a single
 SupercolumnFamily thereby making the rows skinnier, I was wandering as
 to which one will give better performance in terms of reads.
 I think skinnier will definitely have the advantage of less row
 mutations thus good read performance, when, only they, need to be
 retrieved, plus supercolumns of followerlist ,etc are avoided(this
 sounds good as supercolumn indexing limitations will not suck), but I
 still not pretty sure whether it would beneficial in terms of
 performance numbers, if I split the materialized view of single user
 into several columnfamilies instead of single row in single
 Supercolumnfamily.





 On Sat, Jan 8, 2011 at 2:05 AM, Rajkumar Gupta rajkumar@gmail.com wrote:
 The fact that subcolumns inside the supercolumns aren't indexed
 currently may suck here, whenever a small no (10-20 ) of subcolumns
 need to be retreived from a large list of subcolumns of a supercolumn
 like MyPostsIdKeysList.

 On Fri, Jan 7, 2011 at 9:58 PM, Raj rajkumar@gmail.com wrote:
 My question is in context of a social network schema design

 I am thinking of following schema for storing a user's data that is
 required as he logs in  is led to his homepage:-
 (I aimed at a schema design such that through a single row read query
 all the data that would be required to put up the homepage of that
 user, is retreived.)

 UserSuperColumnFamily: {    // Column Family

 UserIDKey:
 {columns:            MyName, MyEmail, MyCity,...etc
  supercolumns:    MyFollowersList, MyFollowiesList, MyPostsIdKeysList,
 MyInterestsList, MyAlbumsIdKeysList, MyVideoIdKeysList,
 RecentNotificationsForUserList,  MessagesReceivedList,
 MessagesSentList, AccountSettingsList, RecentSelfActivityList,
 UpdatesFromFollowiesList
 }
 }

 Thus user's newfeed would be generated using superColumn:
 UpdatesFromFollowiesList. But the UpdatesFromFollowiesList, would
 obviously contain only Id of the posts and not the entire post data.

 Questions:

 1.) What could be the problems with this design, any improvements ?

 2.) Would frequent  heavy overwrite operations/ row mutations (for
 example; when propagating the post updates for news-feed from some
 user to all his followies) which leads to rows ultimately being in
 several SSTables, will lead to degraded read performance ?? Is it
 suitable to use row Cache(too big row but all data required uptil user
 is logged in) If I do not use cache, it may be very expensive to pull
 the row each time a data is required for the given user since row
 would be in several sstables. How can I improve the
 read performance here

 The actual data of the posts from network would be retrieved using
 PostIdKey through subsequent read queries from columnFamily
 PostsSuperColumnFamily which would be like follows:

 PostsSuperColumnFamily:{

 PostIdKey:
 {
 columns:            PostOwnerId, PostBody
 supercolumns:   TagsForPost {list of columns of all tags for the
 post}, PeopleWhoLikedThisPost {list of columns of UserIdKey of all the
 likers}
 }
 }

 Is this the best design to go with or are there any issues to consider
 here ? Thanks in anticipation of your valuable comments.!




From your description UserSuperColumnFamily it seems to be both a
Standard Column and a Super Column. You can not do that. However you
can encode things such as MyName MyCity and MyState into a 'UserInfo'
super Column column. UserInfo:MyState...

(as your mentioned) Super Columns are not indexed and have to be
completely de-serialized for each access. Because of this they are not
widely used for anything but small keys with a few columns. This also
applies to mutations as well, the row can exist in multiple SSTables
until it finally gets compacted. That can result in much more storage
used for an object that changes often.

Most designs use composite keys or using something like JSON encoded
values with Standard Column Families to achieve something like a Super
Column.

(SuperColumns are not always as Super as they seem :)


Re: Welcome committer Jake Luciani

2011-01-13 Thread Edward Capriolo
Three cheers!

On Thu, Jan 13, 2011 at 1:45 PM, Jake Luciani jak...@gmail.com wrote:
 Thanks Jonathan and Cassandra PMC!
 Happy to help Cassandra take over the world!
 -Jake

 On Thu, Jan 13, 2011 at 1:41 PM, Jonathan Ellis jbel...@gmail.com wrote:

 The Cassandra PMC has voted to add Jake as a committer.  (Jake is also
 a committer on Thrift.)

 Welcome, Jake, and thanks for the hard work!

 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of Riptano, the source for professional Cassandra support
 http://riptano.com




Re: cassandra row cache

2011-01-13 Thread Edward Capriolo
Is it possible that your are reading at READ.ONE and that READ.ONE
only warms cache on 1 of your three nodes= 20. 2nd read warms another
60%, and by the third read all the replicas are warm? 99% ?

This would be true if digest reads were not warming caches.

Edward

On Thu, Jan 13, 2011 at 4:07 PM, Saket Joshi sjo...@touchcommerce.com wrote:
 The cache is 800,000 per node , I have 15 nodes in the cluster. I see the 
 cache value increased after the first run, the row cache hit rate was 0 for 
 first run. For second run of the same data , the hit rate increased to 30% 
 but on the third it jumps to 99%


 -Saket

 -Original Message-
 From: Chris Burroughs [mailto:chris.burrou...@gmail.com]
 Sent: Thursday, January 13, 2011 1:03 PM
 To: user@cassandra.apache.org
 Cc: Saket Joshi
 Subject: Re: cassandra row cache

 On 01/13/2011 02:05 PM, Saket Joshi wrote:
 Yes it does change.


 So the confusing part for me is why a cache of size 80,000 would not be
 fill after 1,600,000 requests.  Can you observe items cached and hit
 rate while making the first 1.6 million row query?



Re: about the data directory

2011-01-13 Thread Edward Capriolo
On Thu, Jan 13, 2011 at 7:56 PM, raoyixuan (Shandy)
raoyix...@huawei.com wrote:
 I have some confused, why do the users can read the data in all nodes? I mean 
 the data just be kept in the replica, how to achieve it?

 -Original Message-
 From: sc...@scode.org [mailto:sc...@scode.org] On Behalf Of Peter Schuller
 Sent: Friday, January 14, 2011 1:19 AM
 To: user@cassandra.apache.org
 Subject: Re: about the data directory

 So you mean just the replica node 's sstable will be changed ,right?

 The data will only be written to the nodes that are part of the
 replica set fo the row (with the exception of hinted handoff, but
 that's a different sstable).

 If all the replica node broke down, whether the users can read the data?

 If *all* nodes in the replica set for a particular row are down, then
 you won't be able to read that row, no.

 --
 / Peter Schuller


It does not matter which node you connect to. The node you connect to
determines the hash of the key (or uses the key itself when using
Order Preserving Partitioner) to determine which node or nodes the
data should be on. If the key is on that node it returns it directly
to the client. If the key is not on that node Cassandra fetches it
from another node and then returns that data. The client is unaware
and does not need to be concerned with where the data came from.


Re: live data migration from mysql to cassandra

2011-01-14 Thread Edward Capriolo
On Fri, Jan 14, 2011 at 10:40 AM, ruslan usifov ruslan.usi...@gmail.com wrote:
 Hello

 Dear community please share your experience, home you make live(without
 stop) migration from mysql or other RDBM to cassandra


There is no built in way to do this. I remember hearing at hadoop
world this year that the hbase guys have a system to read mysql slave
logs and replay into hbase. Since all the nosql community seems to do
this maybe we can 'borrow' this idea.

Edward


Re: Cassandra in less than 1G of memory?

2011-01-14 Thread Edward Capriolo
On Fri, Jan 14, 2011 at 2:13 PM, Victor Kabdebon
victor.kabde...@gmail.com wrote:
 Dear rajat,

 Yes it is possible, I have the same constraints. However I must warn you,
 from what I see Cassandra memory consumption is not bounded in 0.6.X on
 debian 64 Bit

 Here is an example of an instance launch in a node :

 root 19093  0.1 28.3 1210696 570052 ?  Sl   Jan11   9:08
 /usr/bin/java -ea -Xms128M -Xmx512M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
 -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1
 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
 -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote.port=8081
 -Dcom.sun.management.jmxremote.ssl=false
 -Dcom.sun.management.jmxremote.authenticate=false
 -Dstorage-config=bin/../conf -Dcassandra-foreground=yes -cp
 bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-0.6.6.jar:bin/../lib/avro-1.2.0-dev.jar:bin/../lib/cassandra-javautils.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-io-1.4.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/commons-pool-1.5.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/hector-0.6.0-14.jar:bin/../lib/high-scale-lib.jar:bin/../lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/perf4j-0.9.12.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar:bin/../lib/uuid-3.1.jar
 org.apache.cassandra.thrift.CassandraDaemon

 Look at the second bold value, Xmx indicates the maximum memory that
 cassandra can use; it is set to be 512, so it could easily fit into 1 Gb.
 Now look at the first one, 570Mb  512 Mb. Moreover if I come back in one
 day the first value will be even higher. Probably around 610 Mb. Actually it
 increases to the point where I need to restart it otherwise other program
 are shut down by Linux for cassandra to further expand its memory usage...

 By the way it's a call to other cassandra users, am I the only one to
 encounter this problem ?

 Best regards,

 Victor K.

 2011/1/14 Rajat Chopra rcho...@makara.com

 Hello.



 According to  JVM heap size topic at
 http://wiki.apache.org/cassandra/MemtableThresholds , Cassandra would need
 atleast 1G of memory to run. Is it possible to have a running Cassandra
 cluster with machines that have less than that memory… say 512M?

 I can live with slow transactions, no compactions etc, but do not want an
 OutOfMemory error. The reason for a smaller bound for Cassandra is that I
 want to leave room for other processes to run.



 Please help with specific parameters to tune.



 Thanks,

 Rajat




-Xmx512M is not an overall memory limit. MMAP'ed files also consume
memory. Try turning disk access mode to standard not (MMAP or
MMAP_INDEX_ONLY).


Re: balancing load

2011-01-16 Thread Edward Capriolo
On Sun, Jan 16, 2011 at 11:45 AM, Karl Hiramoto k...@hiramoto.org wrote:
 Hi,

 I have a keyspace with  Replication Factor: 2
 and it seems though that most of my data goes to one node.


 What am I missing to have Cassandra balance more evenly?

 ./nodetool  -h host1 ring
 Address         Status State   Load            Owns
 Token

 82740373310283352874863875878673027619
 10.1.4.14     Up     Normal  17.45 GB        77.48%
 44427918469925720421829352515848570517
 10.1.4.12     Up     Normal  8.1 GB          8.12%
 58247356085106932369828800153350419939
 10.1.4.13     Up     Normal  49.51 KB        1.66%
 61078635599166706937511052402724559481
 10.1.4.15     Up     Normal  54.48 KB        6.37%
 71909504454725029906187464140698793550
 10.1.4.10     Up     Normal  44.38 KB        6.37%
 82740373310283352874863875878673027619


 I use phpcasa as a client and it should randomly choose a host to
 connect to.

 --
 Karl


For a 5 node cluster your initial Tokens should be:

tokens=5 ant -DclassToRun=hpcas.c01.InitialTokens run
run:
 [java] 0
 [java] 34028236692093846346337460743176821145
 [java] 68056473384187692692674921486353642290
 [java] 102084710076281539039012382229530463435
 [java] 136112946768375385385349842972707284580

To see how these numbers were calculated :
http://wiki.apache.org/cassandra/Operations#Token_selection

Use nodetool move and nodetool cleanup to correct the imbalance of your cluster.


Re: balancing load

2011-01-17 Thread Edward Capriolo
On Mon, Jan 17, 2011 at 2:44 AM, aaron morton aa...@thelastpickle.com wrote:
 The nodes will not automatically delete stale data, to do that you need to 
 run nodetool cleanup.

 See step 3 in the Range Changes  Bootstrap 
 http://wiki.apache.org/cassandra/Operations#Range_changes

 If you are feeling paranoid before hand, you could run nodetool repair on 
 each node in turn to make sure they have the correct data. 
 http://wiki.apache.org/cassandra/Operations#Repairing_missing_or_inconsistent_data

 You may also have some tombstones in there, they will not be deleted until 
 after GCGraceSeconds
 http://wiki.apache.org/cassandra/DistributedDeletes

 Hope that helps.
 Aaron

 On 17 Jan 2011, at 20:34, Karl Hiramoto wrote:

 Thanks for the help.  I used nodetool move, so now each node owns 20%
 of the space, but it seems that the data load is still mostly on 2 nodes.


 nodetool  --host slave4 ring
 Address         Status State   Load            Owns
 Token

      136112946768375385385349842972707284580
 10.1.4.10     Up     Normal  335.9 MB        20.00%
 0
 10.1.4.12     Up     Normal  54.42 KB        20.00%
 34028236692093846346337460743176821145
 10.1.4.13     Up     Normal  59.32 KB        20.00%
 68056473384187692692674921486353642290
 10.1.4.14     Up     Normal  6.33 GB         20.00%
 102084710076281539039012382229530463435
 10.1.4.15     Up     Normal  6.36 GB         20.00%
 136112946768375385385349842972707284580




 --
 Karl



Just to head the next possible problem. If you run 'nodetool cleanup'
on each node and some of your nodes still have more data then others,
then it probably means your are writing the majority of data to a few
keys. ( you probably do not want to do that )

If that happens, you can use nodetool cfstats on each node and ensure
that the 'max row compacted size' is roughly the same on all nodes. If
you have one or two really big rows that could explain your imbalance.


Re: balancing load

2011-01-17 Thread Edward Capriolo
On Mon, Jan 17, 2011 at 10:51 AM, Peter Schuller
peter.schul...@infidyne.com wrote:
 Just to head the next possible problem. If you run 'nodetool cleanup'
 on each node and some of your nodes still have more data then others,
 then it probably means your are writing the majority of data to a few
 keys. ( you probably do not want to do that )

 It may also be that a compact is needed if the discrepancies are
 within the variation expected during normal operation due to
 compaction (this assumes overwrites/deletions in write traffic).

 --
 / Peter Schuller


@Peter Isn't clean up a special case of compaction? IE it works as a
major compaction + removes data not belonging to the node?


Re: balancing load

2011-01-17 Thread Edward Capriolo
On Mon, Jan 17, 2011 at 1:20 PM, Karl Hiramoto k...@hiramoto.org wrote:
 On 01/17/11 15:54, Edward Capriolo wrote:
 Just to head the next possible problem. If you run 'nodetool cleanup'
 on each node and some of your nodes still have more data then others,
 then it probably means your are writing the majority of data to a few
 keys. ( you probably do not want to do that )

 If that happens, you can use nodetool cfstats on each node and ensure
 that the 'max row compacted size' is roughly the same on all nodes. If
 you have one or two really big rows that could explain your imbalance.


 Well, I did a lengthy repair/cleanup  on each node.  but still have data
 mainly on two nodes (I have RF=2)
  ./apache-cassandra-0.7.0/bin/nodetool --host host3 ring
 Address         Status State   Load            Owns
 Token

 119098828422328462212181112601118874007
 10.1.4.10     Up     Normal  347.11 MB       30.00%
 0
 10.1.4.12     Up     Normal  49.41 KB        20.00%
 34028236692093846346337460743176821145
 10.1.4.13     Up     Normal  54.35 KB        20.00%
 68056473384187692692674921486353642290
 10.1.4.15     Up     Normal  19.09 GB        16.21%
 95643579558861158157614918209686336369
 10.1.4.14     Up     Normal  15.62 GB        13.79%
 119098828422328462212181112601118874007


 in cfstats i see:
 Compacted row minimum size: 1131752
 Compacted row maximum size: 8582860529
 Compacted row mean size: 1402350749

 on the lowest used node i see:
 Compacted row minimum size: 0
 Compacted row maximum size: 0
 Compacted row mean size: 0

 I basicly have  MyKeyspace.Offer[UID] = value    my value  is at most
 500 bytes.

 UID i just use a 12 random alpha numeric values  [A-Z][0-9]

 Should i try and adjust my tokens to fix the imbalance or something else?

 I'm using Redhat EL  5.5

 java -version
 java version 1.6.0_17
 OpenJDK Runtime Environment (IcedTea6 1.7.5) (rhel-1.16.b17.el5-x86_64)
 OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode)

 I have some errors in my logs:

 ERROR [ReadStage:1747] 2011-01-17 18:13:53,988
 DebuggableThreadPoolExecutor.java (line 103) Error in ThreadPoolExecutor
 java.lang.AssertionError
        at
 org.apache.cassandra.db.columniterator.SSTableNamesIterator.readIndexedColumns(SSTableNamesIterator.java:178)
        at
 org.apache.cassandra.db.columniterator.SSTableNamesIterator.read(SSTableNamesIterator.java:132)
        at
 org.apache.cassandra.db.columniterator.SSTableNamesIterator.init(SSTableNamesIterator.java:70)
        at
 org.apache.cassandra.db.filter.NamesQueryFilter.getSSTableColumnIterator(NamesQueryFilter.java:59)
        at
 org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:80)
        at
 org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1215)
        at
 org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1107)
        at
 org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1077)
        at org.apache.cassandra.db.Table.getRow(Table.java:384)
        at
 org.apache.cassandra.db.SliceByNamesReadCommand.getRow(SliceByNamesReadCommand.java:60)
        at
 org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:68)
        at
 org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:63)
        at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:636)
 ERROR [ReadStage:1747] 2011-01-17 18:13:53,989
 AbstractCassandraDaemon.java (line 91) Fatal exception in thread
 Thread[ReadStage:1747,5,main]
 java.lang.AssertionError
        at
 org.apache.cassandra.db.columniterator.SSTableNamesIterator.readIndexedColumns(SSTableNamesIterator.java:178)
        at
 org.apache.cassandra.db.columniterator.SSTableNamesIterator.read(SSTableNamesIterator.java:132)
        at
 org.apache.cassandra.db.columniterator.SSTableNamesIterator.init(SSTableNamesIterator.java:70)
        at
 org.apache.cassandra.db.filter.NamesQueryFilter.getSSTableColumnIterator(NamesQueryFilter.java:59)
        at
 org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:80)
        at
 org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1215)
        at
 org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1107)
        at
 org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1077)
        at org.apache.cassandra.db.Table.getRow(Table.java:384)
        at
 org.apache.cassandra.db.SliceByNamesReadCommand.getRow(SliceByNamesReadCommand.java:60)
        at
 org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:68)
        at
 org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:63)
        at
 java.util.concurrent.ThreadPoolExecutor.runWorker

Re: changing the replication level on the fly

2011-01-18 Thread Edward Capriolo
On Tue, Jan 18, 2011 at 2:14 PM, Jeremy Stribling st...@nicira.com wrote:
 Hi,

 I've noticed in the new Cassandra 0.7.0 release that if I have a keyspace
 with a replication level of 2, but only one Cassandra node, I cannot insert
 anything into the system.  Likely this was a bug in the old release I was
 using (0.6.8 -- is there a JIRA describing this problem?).  However, this is
 a problem for our application, as we don't want to have to predefine the
 number of nodes, but rather start with one node, and add nodes as needed.

 Ideally, we could start our system with one node, and be able to insert data
 just on that one node.  Then, when a second node is added, we can start
 using that node to store replicas for the keyspace.  I know that 0.7.0 has a
 new operation for updating keyspace properties like replication level, but
 in the documentation there is some mention about having to run manual repair
 operations after using it.  My question is: what happens if we do not run
 these repair operations?

 Here's what I'd like to do:
 1) Start with a single node with autobootstrap=false and replication
 level=1.
 2) Later, start a second node with autobootstrap=true and join it to the
 first.
 3) The application detects that there are now two nodes, and issues the
 command to pump up the replication level to 2.
 4) If it ever drops back down to one node, it will turn the replication
 level down again.

 If we do not do a repair, will all hell break lose, or will it just be the
 case that data inserted when there was only one node will continue to be
 unreplicated, but data inserted when there were two nodes will have two
 replicas?  Thanks,

 Jeremy



If you up your replication Factor and do not repair this is what happens:

READ.QUORUM - This is safe. Over time all entries that are read will
be fixed through read repair. Reads will return correct data.
BUT data never read will never be copied to the new node.
READ.ONE - 50% of your reads will return correct data. 50% of your
Reads will return NO data the first time (based on the server your
read hits). Then they will be read repaired. Second read will return
the correct data.

You can extrapolate the complications caused be this if you are add 10
or 15 nodes over time. You are never really sure if the data from the
first node got replicated to the second, did the second get replicated
to the third ? Brian hurting... CAP complicated enough...


Re: please help with multiget

2011-01-18 Thread Edward Capriolo
On Tue, Jan 18, 2011 at 4:29 PM, Shu Zhang szh...@mediosystems.com wrote:
 Well, I don't think what I'm describing is complicated semantics. I think 
 I've described general batch operation design and something that is 
 symmetrical the batch_mutate method already on the Cassandra API. You are 
 right, I can solve the problem with further denormalization, and the approach 
 of making individual gets in parallel as described by Brandon will work too. 
 I'll be doing one of these for now. But I think neither is as efficient, and 
 I guess I'm still not sure why the multiget is designed the way it is.

 The problem with denormalization is you gotta make multiple row writes in 
 place of one, adding load to the server, adding required physical space and 
 losing atomicity on write operations. I know writes are cheap in cassandra, 
 and you can catch failed writes and retry so these problems are not major, 
 but it still seems clear that having a batch-get that works appropriately is 
 a least a little better...
 
 From: Aaron Morton [aa...@thelastpickle.com]
 Sent: Tuesday, January 18, 2011 12:55 PM
 To: user@cassandra.apache.org
 Subject: Re: please help with multiget

 I think the general approach is to denormalise data to remove the need for 
 complicated semantics when reading.

 Aaron

 On 19/01/2011, at 7:57 AM, Shu Zhang szh...@mediosystems.com wrote:

 Well, maybe making a batch-get is not  anymore efficient on the server side 
 but without it, you can get bottlenecked on client-server connections and 
 client resources. If the number of requests you want to batch is on the 
 order of connections in your pool, then yes, making gets in parallel is as 
 good or maybe better. But what if you want to batch thousands of requests?

 The server I can scale out, I would want to get my requests there without 
 needing to wait for connections on my client to free up.

 I just don't really understand the reasoning for designing muliget_slice the 
 way it is. I still think if you're gonna have a batch-get request 
 (multiget_slice), you should be able to add to the batch a reasonable number 
 of ANY corresponding non-batch get requests. And you can't do that... Plus, 
 it's not symmetrical to the batch-mutate. Is there a good reason for that?
 
 From: Brandon Williams [dri...@gmail.com]
 Sent: Monday, January 17, 2011 5:09 PM
 To: user@cassandra.apache.org
 Cc: hector-us...@googlegroups.com
 Subject: Re: please help with multiget

 On Mon, Jan 17, 2011 at 6:53 PM, Shu Zhang 
 szh...@mediosystems.commailto:szh...@mediosystems.com wrote:
 Here's the method declaration for quick reference:
 mapstring,listColumnOrSuperColumn multiget_slice(string keyspace, 
 liststring keys, ColumnParent column_parent, SlicePredicate predicate, 
 ConsistencyLevel consistency_level)

 It looks like you must have the same SlicePredicate for every key in your 
 batch retrieval, so what are you suppose to do when you need to retrieve 
 different columns for different keys?

 Issue multiple gets in parallel yourself.  Keep in mind that multiget is not 
 an optimization, in fact, it can work against you when one key exceeds the 
 rpc timeout, because you get nothing back.

 -Brandon


muliget_slice is very useful I IMHO. In my testing, the roundtrip time
for 1000 get requests all being acked individually is much higher then
rountrip time for 200 get_slice grouped 5 at a time. For anyone that
needs that type of access they are in good shape.

I was also theorizing that a CF using RowCache with very, very high
read rate would benefit from pooling a bunch of reads together with
multiget.

I do agree that the first time I looked at the multi_get_slice
signature I realized I could do many of the things I was expecting
from a multi-get.


Re: Cassandra on iSCSI?

2011-01-21 Thread Edward Capriolo
On Fri, Jan 21, 2011 at 12:07 PM, Jonathan Ellis jbel...@gmail.com wrote:
 On Fri, Jan 21, 2011 at 2:19 AM, Mick Semb Wever m...@apache.org wrote:

 Of course with a SAN you'd want RF=1 since it's replicating
 internally.

 Isn't this the same case for raid-5 as well?

 No, because the replication is (mainly) to protect you from machine
 failures; if the SAN is a SPOF then putting more replicas on it
 doesn't help.

 And we want RF=2 if we need to keep reading while doing rolling
 restarts?

 Yes.

 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of Riptano, the source for professional Cassandra support
 http://riptano.com


If you are using cassandra with a SAN RF=1 makes sense because we are
making the assumption the san is already replicating your data. RF2
makes good sense to be not effected by outages. Another alternative is
something like linux-HA and manage each cassandra instance as a
resource. This way if a head goes down another node linux ha would
detect the failure and bring up that instance on another physical
piece of hardware.

Using LinuxHA+SAN+Cassandra would actually bring Cassandra closer to
the hbase model which you have a distributed file system but the front
end Cassandra acts like a region server.


Re: Lost MUTATIONS on several Cassandra nodes - no impact on the client

2011-01-23 Thread Edward Capriolo
On Sun, Jan 23, 2011 at 6:30 AM, ruslan usifov ruslan.usi...@gmail.com wrote:


 2011/1/20 Jonathan Ellis jbel...@gmail.com

 It guarantees that if the requested ConsistencyLevel is not achieved,
 client will get a TimedOutException, which is a signal you need to add
 capacity to handle what you are throwing at the cluster.

 Sorry and when UnavailableException is thows? When data can't be saved
 anywhere?


Right. The difference is that the gossip process builds a topology of
UP/DOWN hosts so Unavailable is thrown quickly. If you need ALL and
one replica is known down - Unavailable.

However if the coordinator believe the node was UP and the request
took longer then RCPTimeout (default 10,000,000 ms) -
TimedOutException


Re: Lost MUTATIONS on several Cassandra nodes - no impact on the client

2011-01-23 Thread Edward Capriolo
On Sun, Jan 23, 2011 at 11:23 AM, ruslan usifov ruslan.usi...@gmail.com wrote:



 On Sun, Jan 23, 2011 at 6:30 AM, ruslan usifov ruslan.usi...@gmail.com
 wrote:

 Right. The difference is that the gossip process builds a topology of
 UP/DOWN hosts so Unavailable is thrown quickly. If you need ALL and
 one replica is known down - Unavailable.


 Is it possible to detect that write doesn't happen anywhere? Or it is only
 possible detect consistency failure?


Regardless of what Exception is thrown you should retry the write from
your client.

If the method threw UnavailableException the write operation did not
happen on any node. As the coordinator judged that it would have not
succeeded.

If the method threw TimedOutException the write could have succeeded
on some nodes but it was not acknowledged on enough to meet the CL
your requested.

It would be nice if the exception could be populated more information such as
TimedOutException:
  requested: 3
  succeeded: 1
  succeededList: 127.0.0.1
  requestedList: 127.0.0.1,127.0.0.2,127.0.0.3

This would make the explanations more self explanatory, and would give
more transparency to the clients.

(knowing my luck thrift probably does not allow complex exception types)


Re: Cassandra + Puppet

2011-01-24 Thread Edward Capriolo
On Mon, Jan 24, 2011 at 5:17 PM, Nate McCall n...@riptano.com wrote:
 Might be a bit out of date, but this one is useful:
 https://github.com/cmceniry/cassandrapuppet



 On Mon, Jan 24, 2011 at 3:51 PM, Aaron Morton aa...@thelastpickle.com wrote:
 Is anyone using puppet http://www.puppetlabs.com/ to deploy / manage
 cassandra  ?
 Has anyone used this
 module https://github.com/plathrop/puppet-module-cassandra or using it for
 backups http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/cassandra_backup_is_a_snap or know
 of any other resources ?
 Cheers
 Aaron



I notice you found my blog. This article is much more detailed.

http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/easy_street_deploying_cassandra_via


Re: cassandra as session store

2011-02-01 Thread Edward Capriolo
On Tue, Feb 1, 2011 at 12:57 PM, Anthony John chirayit...@gmail.com wrote:
 Not a concern - and here is why:-
 From the wiki arch section captioned below - eventual consistency does not
 have to mean inconsistent reads. The concern is the overhead for consistent
 reads. But remember in the use case being cited, the expensive read will
 happen only during failover, not all the time.

 More specifically: R=read replica count W=write replica count N=replication
 factor Q=QUORUM (Q = N / 2 + 1)

 If W + R  N, you will have consistency

 W=1, R=N
 W=N, R=1
 W=Q, R=Q where Q = N / 2 + 1

 On Tue, Feb 1, 2011 at 11:47 AM, Tong Zhu tong@rms.com wrote:

 The problem is where to store the session data. If the session need to be
 accessible by more than one web servers, the external storage is needed.

 Cassandra only supports eventual consistency. If web server w1 saves the
 session at node 1 of cassendra while web server w2 retrieve the session from
 different node, if these two requests are close enough, there is a chance
 what w2 retrieved is different from what w1 saved. Is it a concern?

 Tong



 -Original Message-
 From: buddhasystem [mailto:potek...@bnl.gov]
 Sent: Tuesday, February 01, 2011 9:42 AM
 To: cassandra-u...@incubator.apache.org
 Subject: Re: cassandra as session store


 Most if not all modern web application frameworks support sessions. This
 applies to Django (with which I have most experience and also run it with
 X.509 security layer) but also to Ruby on Rails and Pylons.

 So, why would you re-invent the wheel? Too messy. It's all out there for
 you
 to use.

 Regards,
 Maxim

 --
 View this message in context:
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/cassandra-as-session-store-tp5981871p5981961.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at
 Nabble.com.


 This message and any attachments contain information that may be RMS Inc.
 confidential and/or privileged.  If you are not the intended recipient (or
 authorized to receive for the intended recipient), and have received this
 message in error, any use, disclosure or distribution is strictly
 prohibited.   If you have received this message in error, please notify the
 sender immediately by replying to the e-mail and permanently deleting the
 message from your computer and/or storage system.



Ah. Eventual Consistency! Mama no! RUN!

From:
Download JSR-000315 Java Servlet 3.0 Final Release for Documentation, English

Distributed Environments

Within an application marked as distributable, all requests that are
part of a session
must be handled by one JVM at a time. The container must be able to handle all
objects placed into instances of the HttpSession class using the setAttribute or
putValue methods appropriately. The following restrictions are imposed to meet
these conditions:

This look to be the responsibly of the web cluster to ensure
serialized access not the backend. (At least how I am reading it)


Re: How to delete bulk data from cassandra 0.6.3

2011-02-05 Thread Edward Capriolo
On Sat, Feb 5, 2011 at 4:12 AM, Ali Ahsan ali.ah...@panasiangroup.com wrote:
 Any update on this?

 On 02/05/2011 12:53 AM, Ali Ahsan wrote:

 So do we need to write a script ? or its some thing i can do as a system
 admin without involving and developer.If yes please guide me in this case.




 On 02/04/2011 10:36 PM, Jonathan Ellis wrote:

 In that case, you should shut down the server before removing data files.

 On Fri, Feb 4, 2011 at 9:01 AM,roshandawr...@gmail.com  wrote:

 I thought truncate() was not available before 0.7 (in 0.6.3)was it?

 ---
 Sent from BlackBerry

 -Original Message-
 From: Jonathan Ellisjbel...@gmail.com
 Date: Fri, 4 Feb 2011 08:58:35
 To: useruser@cassandra.apache.org
 Reply-To: user@cassandra.apache.org
 Subject: Re: How to delete bulk data from cassandra 0.6.3

 You should use truncate instead. (Then remove the snapshot truncate
 creates.)

 On Fri, Feb 4, 2011 at 2:05 AM, Ali Ahsanali.ah...@panasiangroup.com
  wrote:

 Hi All

 Is there any way i can delete column families data (not removing column
 families ) from Cassandra without effecting ring integrity.What if  i
 delete
 some column families data in linux with rm command  ?

 --
 S.Ali Ahsan

 Senior System Engineer


 e-Business (Pvt) Ltd

 49-C Jail Road, Lahore, P.O. Box 676
 Lahore 54000, Pakistan

 Tel: +92 (0)42 3758 7140 Ext. 128

 Mobile: +92 (0)345 831 8769

 Fax: +92 (0)42 3758 0027

 Email: ali.ah...@panasiangroup.com



 www.ebusiness-pg.com

 www.panasiangroup.com

 Confidentiality: This e-mail and any attachments may be confidential
 and/or privileged. If you are not a named recipient, please notify the
 sender immediately and do not disclose the contents to another person
 use it for any purpose or store or copy the information in any medium.
 Internet communications cannot be guaranteed to be timely, secure,
 error
 or virus-free. We do not accept liability for any errors or omissions.




 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com







 --
 S.Ali Ahsan

 Senior System Engineer

 e-Business (Pvt) Ltd

 49-C Jail Road, Lahore, P.O. Box 676
 Lahore 54000, Pakistan

 Tel: +92 (0)42 3758 7140 Ext. 128

 Mobile: +92 (0)345 831 8769

 Fax: +92 (0)42 3758 0027

 Email: ali.ah...@panasiangroup.com



 www.ebusiness-pg.com

 www.panasiangroup.com

 Confidentiality: This e-mail and any attachments may be confidential
 and/or privileged. If you are not a named recipient, please notify the
 sender immediately and do not disclose the contents to another person
 use it for any purpose or store or copy the information in any medium.
 Internet communications cannot be guaranteed to be timely, secure, error
 or virus-free. We do not accept liability for any errors or omissions.



in 0.6.X
pkill `pid of cassandra`
rm -rf * /var/lib/cassandra/data/keyspace/CF you want to delete*
(start cassandra)


Re: How to delete bulk data from cassandra 0.6.3

2011-02-05 Thread Edward Capriolo
On Sat, Feb 5, 2011 at 11:35 AM, Ali Ahsan ali.ah...@panasiangroup.com wrote:
 Thanks for replying Edward Capriolo.Will this effect cassandra ring
  integrity? Another question is that will cassandra work properly after this
 operation.And will it be possible to restore deleted  data from backup?.

 in 0.6.X
 pkill `pid of cassandra`
 rm -rf * /var/lib/cassandra/data/keyspace/CF you want to delete*
 (start cassandra)




 --
 S.Ali Ahsan

 Senior System Engineer

 e-Business (Pvt) Ltd

 49-C Jail Road, Lahore, P.O. Box 676
 Lahore 54000, Pakistan

 Tel: +92 (0)42 3758 7140 Ext. 128

 Mobile: +92 (0)345 831 8769

 Fax: +92 (0)42 3758 0027

 Email: ali.ah...@panasiangroup.com



 www.ebusiness-pg.com

 www.panasiangroup.com

 Confidentiality: This e-mail and any attachments may be confidential
 and/or privileged. If you are not a named recipient, please notify the
 sender immediately and do not disclose the contents to another person
 use it for any purpose or store or copy the information in any medium.
 Internet communications cannot be guaranteed to be timely, secure, error
 or virus-free. We do not accept liability for any errors or omissions.



I am not sure what you mean by data integrity.

In short, when Cassandra starts up it searches it's data directories
and loads up the data, index, bloom filters, and saved caches it
finds.

Unless the files are corrupt it will happily load up what it finds.

Restores are done by the process your described , stop server, restore
files, start server.


Re: How bad is teh impact of compaction on performance?

2011-02-05 Thread Edward Capriolo
On Sat, Feb 5, 2011 at 11:59 AM, buddhasystem potek...@bnl.gov wrote:

 Just wanted to see if someone with experience in running an actual service
 can advise me:

 how often do you run nodetool compact on your nodes? Do you stagger it in
 time, for each node? How badly is performance affected?

 I know this all seems too generic but then again no two clusters are created
 equal anyhow. Just wanted to get a feel.

 Thanks,
 Maxim

 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-bad-is-teh-impact-of-compaction-on-performance-tp5995868p5995868.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.


This is an interesting topic. Cassandra can now remove tombstones on
non-major compaction. For some use cases you may not have to trigger
nodetool compact yourself to remove tombstones. Use cases that do not
to many updates, deletes may have the least need to run compaction
yourself.

!However! If you have smaller SSTables, or less SSTables your read
operations will be more efficient.

if you have downtime such as from 1AM-6AM. Going through a major
compaction might shrink you dataset significantly and that will make
reads better.

Compaction can be more or less intensive. The largest factor is is row
size.  Users with large rows probably see faster compaction while
smaller rows see it take a long time. You can lower the priority of
the compaction thread for experimentation.

As to the performance you want to get your cluster to the state where
it is not compacting often. This may mean you need more nodes to
handle writes.

I graph the compaction information from JMX
http://www.jointhegrid.com/cassandra/cassandra-cacti-m6.jsp
to get a feel for how often a node is compacting on average. Also I
cross reference the compaction with Read latency and IO graphs I have
to see what impact compaction has on reads.

Forcing a major compaction also lowers the chances a compaction will
happen during the day on peak time. I major compact a few cluster
nodes each night through cron (gc time 3 days). This has been good for
keeping our data on disk as small as possible. Forcing the major
compact at night uses IO, but i find it saves IO over the course of
the day because each read seeks less on disk.


Re: How bad is teh impact of compaction on performance?

2011-02-05 Thread Edward Capriolo
On Sat, Feb 5, 2011 at 12:48 PM, buddhasystem potek...@bnl.gov wrote:

 Thanks Edward. In our usage scenario, there is never downtime, it's a global
 24/7 operation.

 What is impacted the worst, the read or write?

 How does a node handle compaction when there is a spike of writes coming to
 it?



 Edward Capriolo wrote:

 On Sat, Feb 5, 2011 at 11:59 AM, buddhasystem potek...@bnl.gov wrote:

 Just wanted to see if someone with experience in running an actual
 service
 can advise me:

 how often do you run nodetool compact on your nodes? Do you stagger it in
 time, for each node? How badly is performance affected?

 I know this all seems too generic but then again no two clusters are
 created
 equal anyhow. Just wanted to get a feel.

 Thanks,
 Maxim

 --
 View this message in context:
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-bad-is-teh-impact-of-compaction-on-performance-tp5995868p5995868.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at
 Nabble.com.


 This is an interesting topic. Cassandra can now remove tombstones on
 non-major compaction. For some use cases you may not have to trigger
 nodetool compact yourself to remove tombstones. Use cases that do not
 to many updates, deletes may have the least need to run compaction
 yourself.

 !However! If you have smaller SSTables, or less SSTables your read
 operations will be more efficient.

 if you have downtime such as from 1AM-6AM. Going through a major
 compaction might shrink you dataset significantly and that will make
 reads better.

 Compaction can be more or less intensive. The largest factor is is row
 size.  Users with large rows probably see faster compaction while
 smaller rows see it take a long time. You can lower the priority of
 the compaction thread for experimentation.

 As to the performance you want to get your cluster to the state where
 it is not compacting often. This may mean you need more nodes to
 handle writes.

 I graph the compaction information from JMX
 http://www.jointhegrid.com/cassandra/cassandra-cacti-m6.jsp
 to get a feel for how often a node is compacting on average. Also I
 cross reference the compaction with Read latency and IO graphs I have
 to see what impact compaction has on reads.

 Forcing a major compaction also lowers the chances a compaction will
 happen during the day on peak time. I major compact a few cluster
 nodes each night through cron (gc time 3 days). This has been good for
 keeping our data on disk as small as possible. Forcing the major
 compact at night uses IO, but i find it saves IO over the course of
 the day because each read seeks less on disk.



 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-bad-is-the-impact-of-compaction-on-performance-tp5995868p5995978.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.


It does not have to be downtime. It just has to be a slow time. Use
your traffic graphs to run major compact at the slowest time so it is
least impacting on performance.

Compaction does not generally effect writes or busts or writes,
especially if your writes go to a separate commit log disk.

In the best case scenario compaction may not effect your performance
at all. An example of this would be if your use case is near 100%
reads are serviced by row cache disk is not a factor.

Generally speaking if you have good fast hard disks, and only a single
node is compacting at a given time the cluster absorbs this. In 0.7.0
dynamic snitch should help re-route traffic away from slower nodes for
even less impact. In other words, making compaction non impacting is
all about capacity.


Re: Finding the intersection results of column sets of two rows

2011-02-06 Thread Edward Capriolo
On Sun, Feb 6, 2011 at 10:15 AM, buddhasystem potek...@bnl.gov wrote:

 Hello,

 If the amount of data is _that_ small, you'll have a much easier life with
 MySQL, which supports the join procedure -- because that's exactly what
 you want to achieve.


 asil klin wrote:

 Hi all,

 I want to procure the intersection of columns set of two rows (from 2
 different column families).

 To achieve the intersection results, Can I, first retrieve all
 columns(around 300) from first row and just query by those column
 names in the second row(which contains maximum 100 000 columns) ?

 I am using the results during the write time  not before presentation
 to the user, so latency wont be much concern while writing.

 Is it the proper way to procure intersection results of two rows ?

 Would love to hear your comments..


 -

 Regards,
 Asil



 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Finding-the-intersection-results-of-column-sets-of-two-rows-tp5997248p5997743.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.


You can use multi-get when fetching lists of already know keys
optimize your round rip time.


Re: Cassandra memory consumption

2011-02-08 Thread Edward Capriolo
On Tue, Feb 8, 2011 at 4:56 PM, Victor Kabdebon
victor.kabde...@gmail.com wrote:
 I will do that in the future and I will post my results here ( I upgraded
 the server to debian 6 to see if there is any change, so memory is back to
 normal). I will report in a few days.
 In the meantime I am open to any suggestion...

 2011/2/8 Aaron Morton aa...@thelastpickle.com

 When you attach to the JVM with JConsole how much non heap memory and how
 much heap memory is reported on the memory tab?
 Xmx controls the total size of the heap memory, which excludes the
 permanent generation.
 see

 http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html#generation_sizing
 and

 http://blogs.suncom/jonthecollector/entry/presenting_the_permanent_generation
 Total non-heap memory on a 0.7 box I have is around 27M. You numbers seem
 large but it would be interesting to know what the JVM is reporting.
 Aaron
 On 09 Feb, 2011,at 05:57 AM, Victor Kabdebon victor.kabde...@gmail.com
 wrote:

 Information on the system :

 Debian 5
 Jvm :
 victor@testhost:~/database/apache-cassandra-0.6.6$ java -version
 java version 1.6.0_22
 Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
 Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode)

 RAM : 2Go


 2011/2/8 Victor Kabdebon victor.kabde...@gmail.com

 Sorry Jonathan :

 So most of these informations were taken using the command :

 sudo ps aux | grep cassandra

 For the nodetool information it is :

 /bin/nodetool --host localhost --port 8081 info


 Regars,

 Victor K.


 2011/2/8 Jonathan Ellis jbel...@gmail.com

 I missed the part where you explained where you're getting your numbers
 from.


 On Tue, Feb 8, 2011 at 9:32 AM, Victor Kabdebon
 victor.kabde...@gmail.com wrote:
  It is really weird that I am the only one to have this issue.
  I restarted Cassandra today and already the memory compution is over
  the
  limit :
 
  root  1739  4.0 24.5 664968 494996 pts/4   SLl  15:51   0:12
  /usr/bin/java -ea -Xms128M -Xmx256M -XX:+UseParNewGC
  -XX:+UseConcMarkSweepGC
  -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
  -XX:MaxTenuringThreshold=1
  -XX:CMSInitiatingOccupancyFraction=75
  -XX:+UseCMSInitiatingOccupancyOnly
  -XX:+HeapDumpOnOutOfMemoryError
  -Dcom.sun.management.jmxremote.port=8081
  -Dcom.sun.management.jmxremotessl=false
  -Dcom.sun.management.jmxremote.authenticate=false
  -Dstorage-config=bin/../conf -cp
 
  bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-06.6.jar:bin/../lib/avro-1.2.0-dev.jar:bin/../lib/cassandra-javautils.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-io-1.4.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/commons-pool-1.5.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/hector-0.6.0-14.jar:bin/../lib/high-scale-lib.jar:bin/./lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/jna.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/perf4j-0.9.12.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar:bin/../lib/uuid-3.1.jar
  org.apache.cassandra.thrift.CassandraDaemon
 
  It is really an annoying problem if we cannot really foresee memory
  consumption.
 
  Best regards,
  Victor K
 
  2011/2/8 Victor Kabdebon victor.kabde...@gmail.com
 
  Dear all,
 
  Sorry to come back again to this point but I am really worried about
  Cassandra memory consumption. I have a single machine that runs one
  Cassandra server. There is almost no data on it but I see a crazy
  memory
  consumption and it doesn't care at all about the instructions...
  Note that I am not using mmap, but Standard, I use also JNA (inside
  lib
  folder), i am running on debian 5 64 bits, so a pretty normal
  configuration.
  I also use Cassandra 0.6.8.
 
 
  Here are the informations I gathered on Cassandra :
 
  105  16765  0.1 34.1 1089424 687476 ?  Sl   Feb02  14:58I
  think you are
  /usr/bin/java -ea -Xms128M -Xmx256M -XX:+UseParNewGC
  -XX:+UseConcMarkSweepGC
  -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
  -XX:MaxTenuringThreshold=1
  -XX:CMSInitiatingOccupancyFraction=75
  -XX:+UseCMSInitiatingOccupancyOnly
  -XX:+HeapDumpOnOutOfMemoryError
  -Dcom.sunmanagement.jmxremote.port=8081
  -Dcom.sun.management.jmxremote.ssl=false
  -Dcom.sun.management.jmxremote.authenticate=false
  -Dstorage-config=bin/../conf -Dcassandra-foreground=yes -cp
 
  

Re: Specifying row caching on per query basis ?

2011-02-09 Thread Edward Capriolo
On Wed, Feb 9, 2011 at 2:43 PM, Ertio Lew ertio...@gmail.com wrote:
 Is this under consideration for future releases ? or being thought about!?



 On Thu, Feb 10, 2011 at 12:56 AM, Jonathan Ellis jbel...@gmail.com wrote:
 Currently there is not.

 On Wed, Feb 9, 2011 at 12:04 PM, Ertio Lew ertio...@gmail.com wrote:
 Is there any way to specify on per query basis(like we specify the
 Consistency level), what rows be cached while you're reading them,
 from a row_cache enabled CF. I believe, this could lead to much more
 efficient use of the cache space!!( if you use same data for different
 features/ parts in your application which have different caching
 needs).




 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com



I have mentioned a suggested implemented inside this issue.

https://issues.apache.org/jira/browse/CASSANDRA-2035


Re: Default Listen Port

2011-02-09 Thread Edward Capriolo
On Wed, Feb 9, 2011 at 4:00 PM,  jeremy.truel...@barclayscapital.com wrote:
 What’s the easiest way to change the port nodes listen for comm on from
 other nodes? It appears that the default is 8080 which collides with my
 tomcat server on one of our dev boxes. I tried doing something in
 cassandra.yaml like



 listen_address: 192.1.fake.2:



 but that doesn’t work it throws an exception. Also can you not put the
 actual name of servers in the config or does it always have to be the actual
 ip address currently? Thanks.



 jt





 ___



 This e-mail may contain information that is confidential, privileged or
 otherwise protected from disclosure. If you are not an intended recipient of
 this e-mail, do not duplicate or redistribute it by any means. Please delete
 it and any attachments and notify the sender that you have received it in
 error. Unless specifically indicated, this e-mail is not an offer to buy or
 sell or a solicitation to buy or sell any securities, investment products or
 other financial product or service, an official confirmation of any
 transaction, or an official statement of Barclays. Any views or opinions
 presented are solely those of the author and do not necessarily represent
 those of Barclays. This e-mail is subject to terms available at the
 following link: www.barcap.com/emaildisclaimer. By messaging with Barclays
 you consent to the foregoing.  Barclays Capital is the investment banking
 division of Barclays Bank PLC, a company registered in England (number
 1026167) with its registered office at 1 Churchill Place, London, E14 5HP.
 This email may relate to or be sent from other members of the Barclays
 Group.

 ___

You are having a collision on 8080 which is the default JMX port.

In conf/cassandra-env.sh
look for JMX_PORT=8080

9160 is the thrift port used by clients
7000 is the storage port (used between nodes)

If you change the jmx port you have specify it when using nodetool,
'nodetool -h localhost -p new port ring'


Re: Is Avro still supported?

2011-02-12 Thread Edward Capriolo
https://issues.apache.org/jira/browse/CASSANDRA-926

On Sat, Feb 12, 2011 at 8:27 AM, Joshua Partogi joshua.j...@gmail.com wrote:
 Hi,

 I saw in the latest source in trunk, avro codes has been deleted. Does
 this mean Avro is not supported anymore? If so, what was the decision
 behind dropping the support for Avro?

 Thanks

 --
 http://twitter.com/jpartogi



Re: Does Cassandra support multiple listen_address and rpc_address?

2011-02-13 Thread Edward Capriolo
On Sun, Feb 13, 2011 at 1:39 AM, Xiaobo Gu guxiaobo1...@gmail.com wrote:
 multiple network paths for inner-cluster communication will boost performance

 Thanks.

 Xiaobo Gu


No. Each node has a single IP. You can boost performance in a similar
way with Ethernet bonding, or 10G


Re: consistency question

2011-02-15 Thread Edward Capriolo
On Tue, Feb 15, 2011 at 3:59 AM, Serdar Irmak sir...@protel.com.tr wrote:

 Hi,



 In a 3 node named (named A,B,C) setup with replication factor 3 and quorum
 read/write scenario;

 suppose a new value of data X is written to A and B but not C with any
 reason, then A wend down and I fired D with the data of C or with an empty
 data where in a case is X is not present in D.

 Then when I read quorum, nodes C and D responded and gave me the old value
 (then read repair in background). So doesn’t it mean there is no
 constistency with quorum, too ?





 My best

 Serdar




The consistency rules do NOT apply if you introduce a new node without
properly bootstrapping it. If you have A,B,C  and A fails you should 1)
'nodetool removetoken A'. 2) Start node D with auto_boostrap=true.

You can start a node empty (with bootstrap=false) using quorum/quorum, but
if you do not 'nodetool repair' it before another node fails you end up with
the situation you described.

Edward


Re: What is the most solid version of Cassandra? No secondary indexes needed.

2011-02-15 Thread Edward Capriolo
On Tue, Feb 15, 2011 at 3:03 PM, buddhasystem potek...@bnl.gov wrote:

 Thank you! It's just that 7.1 seems the bleeding edge now (a serious bug
 fixed today). Would you still trust it as a production-level service? I'm
 just slightly concerned. I don't want to create a perception among our IT
 that the product is not ready for prime time.
 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-is-the-most-solid-version-of-Cassandra-No-secondary-indexes-needed-tp6028966p6029047.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.


You are not going to want to go through the 6.X API to 7.0 API
migration. I am still happily running 0.6.8 But I know I need the
features in 0.7.X. If i were starting today I would go with the 0.7.X
branch and be ready to do some minor updates in the next couple
months.


Re: Replica details

2011-02-17 Thread Edward Capriolo
On Thu, Feb 17, 2011 at 1:41 PM, A J s5a...@gmail.com wrote:
 Where can I get good detailed explanation of the various replication
 options (Simple, Old Network and Network) along with snitches. I did
 read the definitive guide but not really satisfied.

 Is there a good post somewhere explaining this ?

 I will have 4 datacenters (assume) and 3 nodes in each DC. I wish to
 have one and only one copy of complete database in each DC. I wish to
 understand how will the ring placement look like.

 Thanks.
 AJ


I hate to break this to you, but the Definite Guide probably has the
best information (including diagrams) out there (that I know of).
Because of all the possible permutations of multi-datacenter setups it
is going to be difficult to find some doc/presentation that describes
EXACTLY what you want to do and how it will work. Here are some hints:
You can setup a simulation cluster. Give each host ips such as
127.0.0.1 127.0.0.2 since you do not have to explicitly configure
those on a single host. Set the XMX low for each instance. Run in
foreground and/or set your logging to verbose so you can debug which
nodes data land on.


Re: Does servers with different capacities in a cluster affect the overall performance?

2011-02-22 Thread Edward Capriolo
On Tue, Feb 22, 2011 at 5:13 AM, XiaoboGu guxiaobo1...@gmail.com wrote:
 I mean servers with different CPU cores ,memory, or disk space, does
 Cassandra allow this kind of configuration?

This is allowed but managing this may be more difficult in production.
Most settings are applied globally at the column family level, such as
memtable_flush_mb for example. This means that the you will you will
never be able to get tuning settings perfect because you will always
have to take a middle ground approach.

Moreover the Random partitioner works best when each node has an equal
share of data. Unbalanced ring is the enemy because nodes with more
data see more requests, and each request has to work through more
data. Thus unbalanced nodes typically become the ones that start
showing performance issues first.

It also becomes really difficult to diagnose performance issues with
an increasing number of variables (this node has 2x data but 4x the
ram of node X, and 30% the processing power of node Y.)

Short of suggesting hardware I hint that 1U's and Blades are good
platforms over big iron because scale out is less difficult then
scale up. Drastically mismatched hardware is something i would avoid.


Re: Distribution Factor: part of the solution to many-CF problem?

2011-02-22 Thread Edward Capriolo
On Mon, Feb 21, 2011 at 5:14 PM, David Boxenhorn da...@lookin2.com wrote:
 No, that's not what I mean at all.

 That message is about the ability to use different partitioners for
 different CFs, say, RandomPartitioner for one, OPP for another.

 I'm talking about defining how many nodes a CF should be distributed over,
 which would be useful if you have a lot of nodes and a lot of small CFs
 (small relative to the total amount of data).


 On Mon, Feb 21, 2011 at 9:58 PM, Aaron Morton aa...@thelastpickle.com
 wrote:

 Sounds a bit like this idea
 http://www.mail-archive.com/dev@cassandra.apache.org/msg01799.html

 Aaron

 On 22/02/2011, at 1:28 AM, David Boxenhorn da...@lookin2.com wrote:

  Cassandra is both distributed and replicated. We have Replication Factor
  but no Distribution Factor!
 
  Distribution Factor would define over how many nodes a CF should be
  distributed.
 
  Say you want to support millions of multi-tenant users in clusters with
  thousands of nodes, where you don't know the user's schema in advance, so
  you can't have users share CFs.
 
  In this case you wouldn't want to spread out each user's Column Families
  over thousands of nodes! You would want something like: RF=3, DF=10 i.e.
  distribute each CF over 10 nodes, within those nodes replicate 3 times.
 
  One implementation of DF would be to hash the CF name, and use the same
  strategies defined for RF to choose the N nodes in DF=N.
 



The single partitioner is baked in

Here is a possible solution. Use OOP, but md5 hash your keys client side.

This solves that, but when you have keyspaces using OOP but with
different key distributions this falls apart.


Re: Distribution Factor: part of the solution to many-CF problem?

2011-02-22 Thread Edward Capriolo
On Tue, Feb 22, 2011 at 2:49 PM, Aaron Morton aa...@thelastpickle.com wrote:
 The single partitioner is baked in
 That was my point.

 You could perhaps write a partitioner that considers the CF when deciding 
 what nodes to put data on. Off the top of my head the partitioner is not told 
 about the  CF the key is storing in.

 Aaron

 On 23/02/2011, at 6:01 AM, Edward Capriolo edlinuxg...@gmail.com wrote:

 On Mon, Feb 21, 2011 at 5:14 PM, David Boxenhorn da...@lookin2.com wrote:
 No, that's not what I mean at all.

 That message is about the ability to use different partitioners for
 different CFs, say, RandomPartitioner for one, OPP for another.

 I'm talking about defining how many nodes a CF should be distributed over,
 which would be useful if you have a lot of nodes and a lot of small CFs
 (small relative to the total amount of data).


 On Mon, Feb 21, 2011 at 9:58 PM, Aaron Morton aa...@thelastpickle.com
 wrote:

 Sounds a bit like this idea
 http://www.mail-archive.com/dev@cassandra.apache.org/msg01799.html

 Aaron

 On 22/02/2011, at 1:28 AM, David Boxenhorn da...@lookin2.com wrote:

 Cassandra is both distributed and replicated. We have Replication Factor
 but no Distribution Factor!

 Distribution Factor would define over how many nodes a CF should be
 distributed.

 Say you want to support millions of multi-tenant users in clusters with
 thousands of nodes, where you don't know the user's schema in advance, so
 you can't have users share CFs.

 In this case you wouldn't want to spread out each user's Column Families
 over thousands of nodes! You would want something like: RF=3, DF=10 i.e.
 distribute each CF over 10 nodes, within those nodes replicate 3 times.

 One implementation of DF would be to hash the CF name, and use the same
 strategies defined for RF to choose the N nodes in DF=N.




 The single partitioner is baked in

 Here is a possible solution. Use OOP, but md5 hash your keys client side.

 This solves that, but when you have keyspaces using OOP but with
 different key distributions this falls apart.



Not to say that this is a bad idea but it breaks the #1 Cassandra law
of Cassandra keep everything balanced. That routine that calculates
natural endpoints does not take the CF into account.

Regarding multi-tenancy, I do not think there is a line in the sand
between running N clusters  and multi-tenancy.

Multi-tenancy is also ambiguous like real time. Does multi-tenancy
mean efficiently supporting 10-20 CFs or 20,000?. I do not see the
cassandra code base supporting a very large number of cf's since it
was designed around a low number of CFs!

Some who may have who have moved from a RDBMS background where a
table looks/works like a columnfamily.  But if that is probably
not denormalized enough. Many in fact advocate You only need 1 CF!


Re: Multiple Seeds

2011-02-23 Thread Edward Capriolo
On Wed, Feb 23, 2011 at 2:30 PM,  jeremy.truel...@barclayscapital.com wrote:
 Yeah I set the tokens, I’m more asking if I start the first seed node with
 autobootstrap set to false the second seed should have it set to true as
 well as all the slave nodes correct? I didn’t see this in the docs but I may
 have just missed it.



 From: Eric Gilmore [mailto:e...@datastax.com]
 Sent: Wednesday, February 23, 2011 2:24 PM
 To: user@cassandra.apache.org
 Subject: Re: Multiple Seeds



 The DataStax documentation offers some answers to those questions in the
 Getting Started section and the Clustering reference docs.

 Autobootstrap should be true, but with the important caveat that
 intial_token values should be specified.  Have a look at those docs, and
 please give feedback on how helpful they are/aren't.

 Regards,

 Eric Gilmore

 On Wed, Feb 23, 2011 at 11:15 AM, jeremy.truel...@barclayscapital.com
 wrote:

 What’s the best way to bring multiple seeds up, should only one of them have
 auto bootstrap set to true or should neither of them? Should they list
 themselves and the other seed in their seed section in the yaml config?

 ___



 This e-mail may contain information that is confidential, privileged or
 otherwise protected from disclosure. If you are not an intended recipient of
 this e-mail, do not duplicate or redistribute it by any means. Please delete
 it and any attachments and notify the sender that you have received it in
 error. Unless specifically indicated, this e-mail is not an offer to buy or
 sell or a solicitation to buy or sell any securities, investment products or
 other financial product or service, an official confirmation of any
 transaction, or an official statement of Barclays. Any views or opinions
 presented are solely those of the author and do not necessarily represent
 those of Barclays. This e-mail is subject to terms available at the
 following link: www.barcap.com/emaildisclaimer. By messaging with Barclays
 you consent to the foregoing.  Barclays Capital is the investment banking
 division of Barclays Bank PLC, a company registered in England (number
 1026167) with its registered office at 1 Churchill Place, London, E14 5HP.
 This email may relate to or be sent from other members of the Barclays
 Group.

 ___



If a node is defined as a seeds it will never auto bootstrap. After it
has bootstrapped and has a system table you can set its yaml file as a
seed if you wish.


Re: Multiple Seeds

2011-02-23 Thread Edward Capriolo
On Wed, Feb 23, 2011 at 2:59 PM,  jeremy.truel...@barclayscapital.com wrote:
 To add a host to the seeds list after it has had the data streamed to it I
 need to



 1.   stop it

 2.   edit the yaml file to

 a.   include it in the seeds list

 b.  set auto boostrap to false

 3.    restart it



 correct? Additionally you would need to add it to the other nodes seed lists
 and restart them as well.



 From: Eric Gilmore [mailto:e...@datastax.com]
 Sent: Wednesday, February 23, 2011 2:47 PM
 To: user@cassandra.apache.org
 Subject: Re: Multiple Seeds



 Well -- when you first bring a node into a ring, you will probably want to
 stream data to it with auto_bootstrap: true.

 If you want that node to be a seed, then add it to the seeds list AFTER it
 has joined the ring.

 I'd refer you to the Seed List and Autoboostrapping sections of the
 Getting Started guide, which contain the following blurbs:

 There is no strict rule to determine which hosts need to be listed as seeds,
 but all nodes in a cluster need the same seed list. For a production
 deployment, DataStax recommends two seeds per data center.

 An autobootstrapping node cannot have itself in the list of seeds nor can it
 contain an initial_token already claimed by another node. To add new seeds,
 autobootstrap the nodes first, and then configure them as seeds.







 On Wed, Feb 23, 2011 at 11:39 AM, jeremy.truel...@barclayscapital.com
 wrote:

 So all seeds should always be set to 'auto_bootstrap: false' in their .yaml
 file.

 -Original Message-
 From: Edward Capriolo [mailto:edlinuxg...@gmail.com]
 Sent: Wednesday, February 23, 2011 2:36 PM
 To: user@cassandra.apache.org

 Cc: Truelove, Jeremy: IT (NYK)
 Subject: Re: Multiple Seeds

 On Wed, Feb 23, 2011 at 2:30 PM,  jeremy.truel...@barclayscapital.com
 wrote:
 Yeah I set the tokens, I'm more asking if I start the first seed node with
 autobootstrap set to false the second seed should have it set to true as
 well as all the slave nodes correct? I didn't see this in the docs but I
 may
 have just missed it.



 From: Eric Gilmore [mailto:e...@datastax.com]
 Sent: Wednesday, February 23, 2011 2:24 PM
 To: user@cassandra.apache.org
 Subject: Re: Multiple Seeds



 The DataStax documentation offers some answers to those questions in the
 Getting Started section and the Clustering reference docs.

 Autobootstrap should be true, but with the important caveat that
 intial_token values should be specified.  Have a look at those docs, and
 please give feedback on how helpful they are/aren't.

 Regards,

 Eric Gilmore

 On Wed, Feb 23, 2011 at 11:15 AM, jeremy.truel...@barclayscapital.com
 wrote:

 What's the best way to bring multiple seeds up, should only one of them
 have
 auto bootstrap set to true or should neither of them? Should they list
 themselves and the other seed in their seed section in the yaml config?

 ___



 This e-mail may contain information that is confidential, privileged or
 otherwise protected from disclosure. If you are not an intended recipient
 of
 this e-mail, do not duplicate or redistribute it by any means. Please
 delete
 it and any attachments and notify the sender that you have received it in
 error. Unless specifically indicated, this e-mail is not an offer to buy
 or
 sell or a solicitation to buy or sell any securities, investment products
 or
 other financial product or service, an official confirmation of any
 transaction, or an official statement of Barclays. Any views or opinions
 presented are solely those of the author and do not necessarily represent
 those of Barclays. This e-mail is subject to terms available at the
 following link: www.barcap.com/emaildisclaimer. By messaging with Barclays
 you consent to the foregoing.  Barclays Capital is the investment banking
 division of Barclays Bank PLC, a company registered in England (number
 1026167) with its registered office at 1 Churchill Place, London, E14 5HP.
 This email may relate to or be sent from other members of the Barclays
 Group.

 ___



 If a node is defined as a seeds it will never auto bootstrap. After it
 has bootstrapped and has a system table you can set its yaml file as a
 seed if you wish.



 ___



 This e-mail may contain information that is confidential, privileged or
 otherwise protected from disclosure. If you are not an intended recipient of
 this e-mail, do not duplicate or redistribute it by any means. Please delete
 it and any attachments and notify the sender that you have received it in
 error. Unless specifically indicated, this e-mail is not an offer to buy or
 sell or a solicitation to buy or sell any securities, investment products or
 other financial product or service, an official confirmation of any
 transaction, or an official statement of Barclays. Any views or opinions
 presented are solely those of the author

Re: Multiple Seeds

2011-02-23 Thread Edward Capriolo
On Wed, Feb 23, 2011 at 3:28 PM,  jeremy.truel...@barclayscapital.com wrote:
 So does cassandra monitor the config file for changes? If it doesn't how else 
 would it know unless you restart you had added a new seed?

 -Original Message-
 From: Edward Capriolo [mailto:edlinuxg...@gmail.com]
 Sent: Wednesday, February 23, 2011 3:23 PM
 To: user@cassandra.apache.org
 Cc: Truelove, Jeremy: IT (NYK)
 Subject: Re: Multiple Seeds

 On Wed, Feb 23, 2011 at 2:59 PM,  jeremy.truel...@barclayscapital.com wrote:
 To add a host to the seeds list after it has had the data streamed to it I
 need to



 1.   stop it

 2.   edit the yaml file to

 a.   include it in the seeds list

 b.  set auto boostrap to false

 3.    restart it



 correct? Additionally you would need to add it to the other nodes seed lists
 and restart them as well.



 From: Eric Gilmore [mailto:e...@datastax.com]
 Sent: Wednesday, February 23, 2011 2:47 PM
 To: user@cassandra.apache.org
 Subject: Re: Multiple Seeds



 Well -- when you first bring a node into a ring, you will probably want to
 stream data to it with auto_bootstrap: true.

 If you want that node to be a seed, then add it to the seeds list AFTER it
 has joined the ring.

 I'd refer you to the Seed List and Autoboostrapping sections of the
 Getting Started guide, which contain the following blurbs:

 There is no strict rule to determine which hosts need to be listed as seeds,
 but all nodes in a cluster need the same seed list. For a production
 deployment, DataStax recommends two seeds per data center.

 An autobootstrapping node cannot have itself in the list of seeds nor can it
 contain an initial_token already claimed by another node. To add new seeds,
 autobootstrap the nodes first, and then configure them as seeds.







 On Wed, Feb 23, 2011 at 11:39 AM, jeremy.truel...@barclayscapital.com
 wrote:

 So all seeds should always be set to 'auto_bootstrap: false' in their .yaml
 file.

 -Original Message-
 From: Edward Capriolo [mailto:edlinuxg...@gmail.com]
 Sent: Wednesday, February 23, 2011 2:36 PM
 To: user@cassandra.apache.org

 Cc: Truelove, Jeremy: IT (NYK)
 Subject: Re: Multiple Seeds

 On Wed, Feb 23, 2011 at 2:30 PM,  jeremy.truel...@barclayscapital.com
 wrote:
 Yeah I set the tokens, I'm more asking if I start the first seed node with
 autobootstrap set to false the second seed should have it set to true as
 well as all the slave nodes correct? I didn't see this in the docs but I
 may
 have just missed it.



 From: Eric Gilmore [mailto:e...@datastax.com]
 Sent: Wednesday, February 23, 2011 2:24 PM
 To: user@cassandra.apache.org
 Subject: Re: Multiple Seeds



 The DataStax documentation offers some answers to those questions in the
 Getting Started section and the Clustering reference docs.

 Autobootstrap should be true, but with the important caveat that
 intial_token values should be specified.  Have a look at those docs, and
 please give feedback on how helpful they are/aren't.

 Regards,

 Eric Gilmore

 On Wed, Feb 23, 2011 at 11:15 AM, jeremy.truel...@barclayscapital.com
 wrote:

 What's the best way to bring multiple seeds up, should only one of them
 have
 auto bootstrap set to true or should neither of them? Should they list
 themselves and the other seed in their seed section in the yaml config?

 ___



 This e-mail may contain information that is confidential, privileged or
 otherwise protected from disclosure. If you are not an intended recipient
 of
 this e-mail, do not duplicate or redistribute it by any means. Please
 delete
 it and any attachments and notify the sender that you have received it in
 error. Unless specifically indicated, this e-mail is not an offer to buy
 or
 sell or a solicitation to buy or sell any securities, investment products
 or
 other financial product or service, an official confirmation of any
 transaction, or an official statement of Barclays. Any views or opinions
 presented are solely those of the author and do not necessarily represent
 those of Barclays. This e-mail is subject to terms available at the
 following link: www.barcap.com/emaildisclaimer. By messaging with Barclays
 you consent to the foregoing.  Barclays Capital is the investment banking
 division of Barclays Bank PLC, a company registered in England (number
 1026167) with its registered office at 1 Churchill Place, London, E14 5HP.
 This email may relate to or be sent from other members of the Barclays
 Group.

 ___



 If a node is defined as a seeds it will never auto bootstrap. After it
 has bootstrapped and has a system table you can set its yaml file as a
 seed if you wish.



 ___



 This e-mail may contain information that is confidential, privileged or
 otherwise protected from disclosure. If you are not an intended recipient of
 this e-mail, do not duplicate or redistribute

Re: Will the large datafile size affect the performance?

2011-02-23 Thread Edward Capriolo
On Wed, Feb 23, 2011 at 4:51 PM, buddhasystem potek...@bnl.gov wrote:

 I know that theoretically it should not (apart from compaction issues), but
 maybe somebody has experience showing otherwise:

 My test cluster now has 250GB of data and will have 1.5TB in its
 reincarnation. If all these data is in a single CF -- will it cause read or
 write performance problems? Should I shard it? One advantage of splitting
 the data would be reducing the impact of compaction and repairs (or so I
 naively assume).

 TIA

 Maxim

 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Will-the-large-datafile-size-affect-the-performance-tp6057991p6057991.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.


http://wiki.apache.org/cassandra/LargeDataSetConsiderations

By dividing your data you get the benefits of being able to apply two
different settings at the Column Family or keyspace level. For example
you might have some batch data that you only want to replicate twice,
or some small subset of data that needs to be read frequently that is
highly cached. Also as you said having three smaller CF's helps you
avoid a single very long running and intensive operations like repair
or major compact.

If you always need to read both CF's to satisfy you application it is
not a good idea.


Re: New Chain for : Does Cassandra use vector clocks

2011-02-23 Thread Edward Capriolo
On Wed, Feb 23, 2011 at 9:28 PM, Ritesh Tijoriwala
tijoriwala.rit...@gmail.com wrote:
 I was about to ask what Anthony's latest post below captures - if we don't
 have vector clocks and no locking, how does cassandra prevent/detect
 conflicts? This is somewhat related to the question I asked in last post
 - http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-td6055152.html
 Thanks,
 Ritesh



 On Wed, Feb 23, 2011 at 6:22 PM, Anthony John chirayit...@gmail.com wrote:

 Apologies : For some reason my response on the original mail keeps
 bouncing back, thus this new one!

  From the other hand, the same article says:
  For conditional writes to work, the condition must be evaluated at all
  update
  sites before the write can be allowed to succeed.
 
  This means, that when doing such an update CL=ALL must be used

 Sorry, but I am confused by that entire thread!
 Questions:-
 1. Does Cassandra implement any kind of data locking - at any granularity
 whether it be row/colF/Col ?
 2. If the answer to 1 above is NO! - how does CL ALL prevent conflicts.
 Concurrent updates on exactly the same piece of data on different nodes can
 still mess each other up, right ?
 -JA


Cassandra does not provide any build in locking. It can not protect
from lost updates caused by multiple independent entities reading
and writing the same data.

The cages library handles locking externally and is really easy to use.
http://ria101.wordpress.com/2010/05/12/locking-and-transactions-over-cassandra-using-cages/


A simple script that creates multi node clusters on a single machine.

2011-02-23 Thread Edward Capriolo
On the mailing list and IRC there are many questions about Cassandra
internals. I understand where the questions are coming from because it
took me a while to get a grip on it.

However if you have a laptop with a descent amount of RAM 2 GB is
enough for 3-5 nodes, (4GB is better). You can kick up a multi-node
cluster right on your laptop. Then you can test failure/eventual
consistent scenarios such as (insert to node A, kill node B, join node
C) till your hearts content.

http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/lauching_5_node_cassandra_clusters


Re: Fill disks more than 50%

2011-02-23 Thread Edward Capriolo
On Wed, Feb 23, 2011 at 9:39 PM, Terje Marthinussen
tmarthinus...@gmail.com wrote:
 Hi,
 Given that you have have always increasing key values (timestamps) and never
 delete and hardly ever overwrite data.
 If you want to minimize work on rebalancing and statically assign (new)
 token ranges to new nodes as you add them so they always get the latest
 data
 Lets say you add a new node each year to handle next years data.
 In a scenario like this, could you with 0.7 be able to safely fill disks
 significantly more than 50% and still manage things like repair/recovery of
 faulty nodes?

 Regards,
 Terje

Since all your data for a day/month/year would sit on the same server.
Meaning all your servers with old data would be idle and your servers
with current data would be very busy. This is probably not a good way
to go.

There is a ticket open for 0.8 for efficient node moves joins. It is
already a lot better in 0.7. Pretend you did not see this (you can
join nodes using rsync if you know some tricks) if you are really
afraid of joins, which you really should not be.

As for the 50% statement. In a worse case scenario a major compaction
will require double the disk size of your column family. So if you
have more then 1 column family you do NOT need 50% overhead.


Re: Fill disks more than 50%

2011-02-24 Thread Edward Capriolo
On Thu, Feb 24, 2011 at 4:08 AM, Thibaut Britz
thibaut.br...@trendiction.com wrote:
 Hi,

 How would you use rsync instead of repair in case of a node failure?

 Rsync all files from the data directories from the adjacant nodes
 (which are part of the quorum group) and then run a compactation which
 will? remove all the unneeded keys?

 Thanks,
 Thibaut


 On Thu, Feb 24, 2011 at 4:22 AM, Edward Capriolo edlinuxg...@gmail.com 
 wrote:
 On Wed, Feb 23, 2011 at 9:39 PM, Terje Marthinussen
 tmarthinus...@gmail.com wrote:
 Hi,
 Given that you have have always increasing key values (timestamps) and never
 delete and hardly ever overwrite data.
 If you want to minimize work on rebalancing and statically assign (new)
 token ranges to new nodes as you add them so they always get the latest
 data
 Lets say you add a new node each year to handle next years data.
 In a scenario like this, could you with 0.7 be able to safely fill disks
 significantly more than 50% and still manage things like repair/recovery of
 faulty nodes?

 Regards,
 Terje

 Since all your data for a day/month/year would sit on the same server.
 Meaning all your servers with old data would be idle and your servers
 with current data would be very busy. This is probably not a good way
 to go.

 There is a ticket open for 0.8 for efficient node moves joins. It is
 already a lot better in 0.7. Pretend you did not see this (you can
 join nodes using rsync if you know some tricks) if you are really
 afraid of joins, which you really should not be.

 As for the 50% statement. In a worse case scenario a major compaction
 will require double the disk size of your column family. So if you
 have more then 1 column family you do NOT need 50% overhead.


@Thibaut Britz
Caveat:Using simple strategy.
This works because cassandra scans data at startup and then serves
what it finds. For a join for example you can rsync all the data from
the node below/to the right of where the new node is joining. Then
join without bootstrap then cleanup both nodes. (also you have to
shutdown the first node so you do not have a lost write scenario in
the time between rsync and new node startup)

It does not make as much sense for repair because the data on a node
will tripple, before you compact/cleanup it.

@Terje
I am suggesting that your probably want to rethink your scheme design
since partitioning by year is going to be bad performance since the
old servers are going to be nothing more then expensive tape drives.


Re: New Chain for : Does Cassandra use vector clocks

2011-02-24 Thread Edward Capriolo
On Thu, Feb 24, 2011 at 3:03 PM, A J s5a...@gmail.com wrote:
 yes, that is difficult to digest and one has to be sure if the use
 case can afford it.

 Some other NOSQL databases deals with it differently (though I don't
 think any of them use atomic 2-phase commit). MongoDB for example will
 ask you to read from the node you wrote first (primary node) unless
 you are ok with eventual consistency. If the write did not make to
 majority of other nodes, it will be rolled-back from the original
 primary when it comes up again as a secondary.
 In some cases, you still could server either new value (that was
 returned as failed) or the old one. But it is different from Cassandra
 in the sense that Cassandra will never rollback.



 On Thu, Feb 24, 2011 at 2:47 PM, Anthony John chirayit...@gmail.com wrote:
 The leap of faith here is that an error does not mean a clean backing out to
 prior state - as we are used to with databases. It means that the operation
 in error could have gone through partially

 Again, this is not an absolutely unfamiliar territory and can be dealt with.
 -JA
 On Thu, Feb 24, 2011 at 1:16 PM, A J s5a...@gmail.com wrote:

 but could be broken in case of a failed write
 You can think of a scenario where R + W N still leads to
 inconsistency even for successful writes. Say you keep W=1 and R=N .
 Lets say the one node where a write happened with success goes down
 before it made to the other N-1 nodes. Lets say it goes down for good
 and is unrecoverable. The only option is to build a new node from
 scratch from other active nodes. This will lead to a write that was
 lost and you will end up serving stale copy of it.

 It is better to talk in terms of use cases and if cassandra will be a
 fit for it. Otherwise unless you have W=R=N and fsync before each
 write commit, there will be scope for inconsistency.


 On Thu, Feb 24, 2011 at 1:25 PM, Anthony John chirayit...@gmail.com
 wrote:
  I see the point - apologies for putting everyone through this!
  It was just militating against my mental model.
  In summary, here is my take away - simple stuff but - IMO - important to
  conclude this thread (I hope):-
  1. I was splitting hair over a failed ( partial ) Q Write. Such an event
  should be immediately followed by the same write going to a connection
  on to
  another node ( potentially using connection caches of client
  implementations
  ) or a Read at CL of All. Because a write could have partially gone
  through.
  2. Timestamps are used in determining the latest version ( correcting
  the
  false impression I was propagating)
  Finally, wrt W + R  N for Q CL statement holds, but could be broken
  in
  case of a failed write as it is unsure whether the new value got written
  on
   any server or not. Is that a fair characterization ?
  Bottom line - unlike traditional DBMS, errors do not ensure automatic
  cleanup and revert back, app code has to follow up if  immediate - and
  not
  eventual -  consistency is desired. I made that leap in almost all cases
  - I
  think - but the case of a failed write.
  My bad and I can live with this!
  Regards,
  -JA
 
  On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne
  sylv...@datastax.com
  wrote:
 
  On Thu, Feb 24, 2011 at 6:33 PM, Anthony John chirayit...@gmail.com
  wrote:
 
  Completely understand!
  All that I am quibbling over is whether a CL of quorum guarantees
  consistency or not. That is what the documentation says - right. IF
  for a CL
  of Q read - it depends on which node returns read first to determine
  the
  actual returned result or other more convoluted conditions , then a
  Quorum
  read/write is not consistent, by any definition.
 
  But that's the point. The definition of consistency we are talking
  about
  has no meaning if you consider only a quorum read. The definition
  (which is
  the de facto definition of consistency in 'eventually consistent') make
  sense if we talk about a write followed by a read. And it is
  considering succeeding write followed by succeeding read.
  And that is the statement the wiki is making.
  Honestly, we could debate forever on the definition of consistency and
  whatnot. Cassandra guaranties that if you do a (succeeding) write on W
  replica and then a (succeeding) read on R replica and if R+WN, then it
  is
  guaranteed that the read will see the preceding write. And this is what
  is
  called consistency in the context of eventual consistency (which is not
  the
  context of ACID).
  If this is not the definition of consistency you had in mind then by
  all
  mean, Cassandra probably don't guarantee this definition. But given
  that the
  paragraph preceding what you pasted state clearly we are not talking
  about
  ACID consistency, but eventual consistency, I don't think the wiki is
  making
  any unfair statement.
  That being said, the wiki may not be always as clear as it could. But
  it's
  an editable wiki :)
  --
  Sylvain
 
 
  I can still use Cassandra, and will use it, luv 

Re: Understanding Indexes

2011-02-24 Thread Edward Capriolo
On Thu, Feb 24, 2011 at 3:34 PM, mcasandra mohitanch...@gmail.com wrote:

 I wasn't aware that there is an index on primary key (that is row keys). So
 from what I understand there is by default an index on for eg: , in
 below example? Where can I read more about it?

 UserProfile = { // this is a ColumnFamily
     {   // this is the key to this Row inside the CF
        // now we have an infinite # of columns in this row
        username: phatduckk,
        email: [hidden email],
        phone: (900) 976-
    }, // end row
     {   // this is the key to another row in the CF
        // now we have another infinite # of columns in this row
        username: ieure,
        email: [hidden email],
        phone: (888) 555-1212
        age: 66,
        gender: undecided
    },
  }


 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6061857.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.



Dude! You are running before you can walk why are your worried about
secondary indexing before you know what the primary index is? :)

http://wiki.apache.org/cassandra/ArchitectureOverview
http://wiki.apache.org/cassandra/ArchitectureSSTable


Re: Understanding Indexes

2011-02-24 Thread Edward Capriolo
On Thu, Feb 24, 2011 at 3:55 PM, mcasandra mohitanch...@gmail.com wrote:

 Either I am not explaning properly or I don't understand the data model just
 yet. Please check again:

 In below example this is what I understand:

 1) UserProfile is a CF
 2)  is a row key
 3) username is a column. Each row (eg ) has username column

 My understanding is that secondary indexes can be created only on column
 value. Which means I can create secondary index only on username, email etc.
 not on .  is the row key, but you keep saying that I need secondary
 index, but I am actually asking about index on the row key.

 Is my understanding incorrect about this?

 UserProfile = { // this is a ColumnFamily
     {   // this is the key to this Row inside the CF
        // now we have an infinite # of columns in this row
        username: phatduckk,
        email: [hidden email],
        phone: (900) 976-
    }, // end row
     {   // this is the key to another row in the CF
        // now we have another infinite # of columns in this row
        username: ieure,
        email: [hidden email],
        phone: (888) 555-1212
        age: 66,
        gender: undecided
    },
  }

 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6061959.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.


You do not need secondary indexes to search on the RowKey. The Row Key
is used by the partitioner to locate your data across the cluster. The
Row Key is also used as the primary sort of the SSTables. Thus the row
key is naturally indexed.


Re: New Chain for : Does Cassandra use vector clocks

2011-02-24 Thread Edward Capriolo
On Thu, Feb 24, 2011 at 3:56 PM, A J s5a...@gmail.com wrote:
 While we are at it, there's more to consider than just CAP in distributed :)
 http://voltdb.com/blog/clarifications-cap-theorem-and-data-related-errors

 On Thu, Feb 24, 2011 at 3:31 PM, Edward Capriolo edlinuxg...@gmail.com 
 wrote:
 On Thu, Feb 24, 2011 at 3:03 PM, A J s5a...@gmail.com wrote:
 yes, that is difficult to digest and one has to be sure if the use
 case can afford it.

 Some other NOSQL databases deals with it differently (though I don't
 think any of them use atomic 2-phase commit). MongoDB for example will
 ask you to read from the node you wrote first (primary node) unless
 you are ok with eventual consistency. If the write did not make to
 majority of other nodes, it will be rolled-back from the original
 primary when it comes up again as a secondary.
 In some cases, you still could server either new value (that was
 returned as failed) or the old one. But it is different from Cassandra
 in the sense that Cassandra will never rollback.



 On Thu, Feb 24, 2011 at 2:47 PM, Anthony John chirayit...@gmail.com wrote:
 The leap of faith here is that an error does not mean a clean backing out 
 to
 prior state - as we are used to with databases. It means that the operation
 in error could have gone through partially

 Again, this is not an absolutely unfamiliar territory and can be dealt 
 with.
 -JA
 On Thu, Feb 24, 2011 at 1:16 PM, A J s5a...@gmail.com wrote:

 but could be broken in case of a failed write
 You can think of a scenario where R + W N still leads to
 inconsistency even for successful writes. Say you keep W=1 and R=N .
 Lets say the one node where a write happened with success goes down
 before it made to the other N-1 nodes. Lets say it goes down for good
 and is unrecoverable. The only option is to build a new node from
 scratch from other active nodes. This will lead to a write that was
 lost and you will end up serving stale copy of it.

 It is better to talk in terms of use cases and if cassandra will be a
 fit for it. Otherwise unless you have W=R=N and fsync before each
 write commit, there will be scope for inconsistency.


 On Thu, Feb 24, 2011 at 1:25 PM, Anthony John chirayit...@gmail.com
 wrote:
  I see the point - apologies for putting everyone through this!
  It was just militating against my mental model.
  In summary, here is my take away - simple stuff but - IMO - important to
  conclude this thread (I hope):-
  1. I was splitting hair over a failed ( partial ) Q Write. Such an event
  should be immediately followed by the same write going to a connection
  on to
  another node ( potentially using connection caches of client
  implementations
  ) or a Read at CL of All. Because a write could have partially gone
  through.
  2. Timestamps are used in determining the latest version ( correcting
  the
  false impression I was propagating)
  Finally, wrt W + R  N for Q CL statement holds, but could be broken
  in
  case of a failed write as it is unsure whether the new value got written
  on
   any server or not. Is that a fair characterization ?
  Bottom line - unlike traditional DBMS, errors do not ensure automatic
  cleanup and revert back, app code has to follow up if  immediate - and
  not
  eventual -  consistency is desired. I made that leap in almost all cases
  - I
  think - but the case of a failed write.
  My bad and I can live with this!
  Regards,
  -JA
 
  On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne
  sylv...@datastax.com
  wrote:
 
  On Thu, Feb 24, 2011 at 6:33 PM, Anthony John chirayit...@gmail.com
  wrote:
 
  Completely understand!
  All that I am quibbling over is whether a CL of quorum guarantees
  consistency or not. That is what the documentation says - right. IF
  for a CL
  of Q read - it depends on which node returns read first to determine
  the
  actual returned result or other more convoluted conditions , then a
  Quorum
  read/write is not consistent, by any definition.
 
  But that's the point. The definition of consistency we are talking
  about
  has no meaning if you consider only a quorum read. The definition
  (which is
  the de facto definition of consistency in 'eventually consistent') make
  sense if we talk about a write followed by a read. And it is
  considering succeeding write followed by succeeding read.
  And that is the statement the wiki is making.
  Honestly, we could debate forever on the definition of consistency and
  whatnot. Cassandra guaranties that if you do a (succeeding) write on W
  replica and then a (succeeding) read on R replica and if R+WN, then it
  is
  guaranteed that the read will see the preceding write. And this is what
  is
  called consistency in the context of eventual consistency (which is not
  the
  context of ACID).
  If this is not the definition of consistency you had in mind then by
  all
  mean, Cassandra probably don't guarantee this definition. But given
  that the
  paragraph preceding what you pasted state clearly

Re: Fill disks more than 50%

2011-02-25 Thread Edward Capriolo
On Fri, Feb 25, 2011 at 7:38 AM, Terje Marthinussen
tmarthinus...@gmail.com wrote:

 @Thibaut Britz
 Caveat:Using simple strategy.
 This works because cassandra scans data at startup and then serves
 what it finds. For a join for example you can rsync all the data from
 the node below/to the right of where the new node is joining. Then
 join without bootstrap then cleanup both nodes. (also you have to
 shutdown the first node so you do not have a lost write scenario in
 the time between rsync and new node startup)


 rsync all data from node to left/right..
 Wouldn't that mean that you need 2x the data to recover...?
 Terje

Terje,

In your scenario where you are never updating running repair becomes
less important. I have an alternative for you. I have a program I call
the RescueRanger we use it to range-scan all our data, find old
entries and then delete them. However if we set that program to read
only mode and tell it to read at CL.ALL, It becomes a program that
read repairs data!

This is a tradeoff. Range scanning though all your data is not fast,
but it does not require the extra disk space. Kinda like merge sort vs
bubble sort.


Re: Storing photos, images, docs etc.

2011-03-01 Thread Edward Capriolo
On Tue, Mar 1, 2011 at 1:43 PM, mcasandra mohitanch...@gmail.com wrote:
 Is it advisable or ok to store photos, images and docs in cassandra where you
 expect high volume of uploads and views?

 I was reading about facebook implementation of haystack to store the photos.
 They don't put anything in their mysql db.

 Since Cassandra is different from mysql I was wondering if it's ok or if
 there are going to be any issues.

 I tried searching online to read articles or papers on similar subject but
 couldn't find any where cassandra was being used to store docs/images etc.

 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Storing-photos-images-docs-etc-tp6078278p6078278.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.


Google of terms cassandra large files + feeling lucky
http://www.google.com/search?q=cassandra+large+filesie=utf-8oe=utf-8aq=trls=org.mozilla:en-US:officialclient=firefox-a

Yields:
http://wiki.apache.org/cassandra/FAQ#large_file_and_blob_storage

This is also nearly a bi-monthly mailing list topic.


Re: Storing photos, images, docs etc.

2011-03-03 Thread Edward Capriolo
On Thu, Mar 3, 2011 at 2:49 PM, mcasandra mohitanch...@gmail.com wrote:
 Has anyone heard about lustre distributed file system? I am wondering if it
 will work well where keep the metadata in Cassandra and images in Lustre.

 I looked at MogileFS but not too sure about it's support.

 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Storing-photos-images-docs-etc-tp6078278p6086135.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.


Luster and GlusterFS are cool but this is apples an oranges. Those are
both mountable file systems with POSIX support.This is very different
then a key value store.


Re: Poor performance on small data set

2011-03-11 Thread Edward Capriolo
On Fri, Mar 11, 2011 at 11:44 AM, Peter Schuller
peter.schul...@infidyne.com wrote:
 There is less than 1000 rows and i've got a 75-100ms to get one row by id
 With memcached it's 2ms

 I don't know where is the problem. jvm ? cassandra ? phpcassa ?

 What can i do to detect where is the problem ?

 I'm not familiar with the PHP client, but this sounds suspiciously
 like a nagle + delayed ACK problem. The PHP client probably isn't
 setting the TCP_NODELAY flag (or the equivalent in Windows).

 Google for nagle delayed ack for details.

 --
 / Peter Schuller

Also you will find that setting rowsCached and keysCached not
effective. Chose one or the other. (that is not your problem but an
FYI)


Re: Is column update column-atomic or row atomic?

2011-03-15 Thread Edward Capriolo
On Tue, Mar 15, 2011 at 5:46 PM, buddhasystem potek...@bnl.gov wrote:
 Sorry for the rather primitive question, but it's not clear to me if I need
 to fetch the whole row, add a column as a dictionary entry and re-insert it
 if I want to expand the row by one column. Help will be appreciated.


 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Is-column-update-column-atomic-or-row-atomic-tp6174445p6174445.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.


No. In Cassandra you do not need to read to write. You should try to
avoid it if possible.


Re: Please help decipher /proc/cpuinfo for optimal Cassandra config

2011-03-16 Thread Edward Capriolo
On Wed, Mar 16, 2011 at 9:58 PM, buddhasystem potek...@bnl.gov wrote:
 Dear All,
 this is from my new Cassandra server. It obviously uses hyperthreading, I
 just don't know how to translate this to concurrent readers and writers in
 cassandra.yaml -- can somebody take a look and tell me what number of cores
 I need to assume for concurrent_reads and concurrent_writes. Is it 24?
 Thanks!

 [cassandra@cassandra01 bin]$ cat /proc/cpuinfo
 processor       : 0
 vendor_id       : GenuineIntel
 cpu family      : 6
 model           : 44
 model name      : Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz
 stepping        : 2
 cpu MHz         : 1596.000
 cache size      : 12288 KB
 physical id     : 0
 siblings        : 12
 core id         : 0
 cpu cores       : 6
 apicid          : 0
 initial apicid  : 0
 fpu             : yes
 fpu_exception   : yes
 cpuid level     : 11
 wp              : yes
 flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov
 pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb
 rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc
 aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16
 xtpr pdcm dca sse4_1 sse4_2 popcnt aes lahf_lm arat tpr_shadow vnmi
 flexpriority ept vpid
 bogomips        : 5333.91
 clflush size    : 64
 cache_alignment : 64
 address sizes   : 40 bits physical, 48 bits virtual
 power management:

 processor       : 1
 vendor_id       : GenuineIntel
 cpu family      : 6
 model           : 44
 model name      : Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz
 stepping        : 2
 cpu MHz         : 1596.000
 cache size      : 12288 KB
 physical id     : 0
 siblings        : 12
 core id         : 1
 cpu cores       : 6
 apicid          : 2
 initial apicid  : 2
 fpu             : yes
 fpu_exception   : yes
 cpuid level     : 11
 wp              : yes
 flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov
 pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb
 rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc
 aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16
 xtpr pdcm dca sse4_1 sse4_2 popcnt aes lahf_lm arat tpr_shadow vnmi
 flexpriority ept vpid
 bogomips        : 5333.15
 clflush size    : 64
 cache_alignment : 64
 address sizes   : 40 bits physical, 48 bits virtual
 power management:

 processor       : 2
 vendor_id       : GenuineIntel
 cpu family      : 6
 model           : 44
 model name      : Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz
 stepping        : 2
 cpu MHz         : 1596.000
 cache size      : 12288 KB
 physical id     : 0
 siblings        : 12
 core id         : 2
 cpu cores       : 6
 apicid          : 4
 initial apicid  : 4
 fpu             : yes
 fpu_exception   : yes
 cpuid level     : 11
 wp              : yes
 flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov
 pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb
 rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc
 aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16
 xtpr pdcm dca sse4_1 sse4_2 popcnt aes lahf_lm arat tpr_shadow vnmi
 flexpriority ept vpid
 bogomips        : 5333.15
 clflush size    : 64
 cache_alignment : 64
 address sizes   : 40 bits physical, 48 bits virtual
 power management:

 processor       : 3
 vendor_id       : GenuineIntel
 cpu family      : 6
 model           : 44
 model name      : Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz
 stepping        : 2
 cpu MHz         : 1596.000
 cache size      : 12288 KB
 physical id     : 0
 siblings        : 12
 core id         : 8
 cpu cores       : 6
 apicid          : 16
 initial apicid  : 16
 fpu             : yes
 fpu_exception   : yes
 cpuid level     : 11
 wp              : yes
 flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov
 pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb
 rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc
 aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16
 xtpr pdcm dca sse4_1 sse4_2 popcnt aes lahf_lm arat tpr_shadow vnmi
 flexpriority ept vpid
 bogomips        : 5333.15
 clflush size    : 64
 cache_alignment : 64
 address sizes   : 40 bits physical, 48 bits virtual
 power management:

 processor       : 4
 vendor_id       : GenuineIntel
 cpu family      : 6
 model           : 44
 model name      : Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz
 stepping        : 2
 cpu MHz         : 1596.000
 cache size      : 12288 KB
 physical id     : 0
 siblings        : 12
 core id         : 9
 cpu cores       : 6
 apicid          : 18
 initial apicid  : 18
 fpu             : yes
 fpu_exception   : yes
 cpuid level     : 11
 wp              : yes
 flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov
 pat pse36 

Re: Replacing a dead seed

2011-03-17 Thread Edward Capriolo
On Thu, Mar 17, 2011 at 9:09 AM, Jonathan Colby
jonathan.co...@gmail.com wrote:
 Hi -

 If a seed crashes (i.e., suddenly unavailable due to HW problem),   what is 
 the best way to replace the seed in the cluster?

 I've read that you should not bootstrap a seed.  Therefore I came up with 
 this procedure, but it seems pretty complicated.  any better ideas?

 1. update the seed list on all nodes, taking out the dead node  and restart 
 the nodes in the  cluster so the new seed list is updated
 2. then bootstrap the new (replacement ) node as a normal node  (not yet as a 
 seed)
 3. when bootstrapping is done, make the new node a seed.
 4. update the seed list again adding back the replacement seed (and rolling 
 restart the cluster as in step 1)


 That seems to me like a whole lot of work.  Surely there is a better way?

 Jon

It is true that Seeds do not auto bootstrap. But in this case it does
not matter if the other nodes believe this node is a seed. It only
matters what the joining node is configured to believe.

On the joining node do not include it's hostname/IP in the seed list
and it should auto-bootstrap normally.


Re: Optimizing a few nodes to handle all client connections?

2011-03-19 Thread Edward Capriolo
On Fri, Mar 18, 2011 at 9:55 PM, Jason Harvey alie...@gmail.com wrote:
 Hola everyone,

 I have been considering making a few nodes only manage 1 token and
 entirely dedicating them to talking to clients. My reasoning behind
 this is I don't like the idea of a node having a dual-duty of handling
 data, and talking to all of the client stuff.

 Is there any merit to this thought?

 Cheers,
 Jason


Technically possible but not recommended. Beside making this node a
single point of failure, you assuredly add more latency to every
request. Also each request has memory overhead, one node will have the
sum overhead of all the requests it is not scalable. Also this node
can become a bandwidth limit.

One of the reasons to chose cassandra is it does NOT have a
master/queen node that all requests are proxied through.


Re: Working backwards from production to staging/dev

2011-03-26 Thread Edward Capriolo
On Fri, Mar 25, 2011 at 2:11 PM, ian douglas i...@armorgames.com wrote:
 On 03/25/2011 10:12 AM, Jonathan Ellis wrote:

 On Fri, Mar 25, 2011 at 11:59 AM, ian douglasi...@armorgames.com  wrote:

 (we're running v0.60)

 I don't know if you could hear that from where you are, but our whole
 office just yelled, WTF! :)

 Ah, that's what that noise was... And yeah, we know we're way behind. Our
 initial delay in upgrading was waiting for 0.7 to come out and then we
 learned we needed a whole new Thrift client for our PHP code base, and then
 we got busy on other things, but we're at a point where we have some time to
 take care of Cassandra and get it upgraded.

  Our planned path, now, is:

 (our nodes' tokens are numbered using the python code (0, 1/3 and 2/3 times
 2^127), and called node 1 through 3, respectively; our RF is set to 2 right
 now)

 1. remove node 1 from our software
 2. bring node 1 offline after a flush/repair/cleanup
 3. run a cleanup on node 2 and then on node 3 so they have a full copy of
 all data from the old node 1 and each other.
 4. bring up a new Large 64-bit instance, install 0.6.12, assign a Token
 value of 0 (node 1), RF:2, on a new gossip ring, and copy all data from the
 32-bit nodes 2 and 3 and run a repair/cleanup to remove any duplicated data
 5. remove node 3 from our software
 6. point our code to the new 64-bit node 1
 7. bring node 3 offline after a flush/repair/cleanup so node 2 has the last
 fresh copy of everything
 8. bring node 2 offline after a flush/repair/cleanup
 9. bring up another Large instance, get a copy of all data from our old node
 2, assign a Token value of (1/2 * 2^127), RF:2, on the new gossip ring, run
 a repair to remove duplicate data, and then a cleanup so it gets replicated
 data from the new node 1
 10. add the new node 2 to our software
 11. run a final cleanup on the new node 1 and then on node 2 to make sure
 all data is replicated evenly on both nodes

 ... at this point, we should have two 64-bit Large instances, with RF:2, on
 a new gossip ring, replacing three 32-bit systems, with minimal down time
 and no data loss (just a data delay between steps 6 and 10 above).

 Questions:
 1. Does it appear that we've missed any steps, or doing something out of
 order?
 2. Is the flush/repair/cleanup overkill when bringing the old nodes offline,
 or is that the correct sequence to follow?
 3. Will the difference in compute units (lower on Large instances than
 Medium instances) make any noticeable difference, or will the fact that the
 machine is 64-bit handle things efficiently enough such that a Large
 instance works harder than a Medium instance? (never did figure out their
 how their compute units work)
 4. Can we follow similar steps when we're ready to upgrade to 0.7x and have
 our new Thrift client for PHP all squared away?


 Thanks again for the help!!!



If you have a node with an old column family you are not using
anymore...Stop node...delete data...start node.

Edward


Re: Starter GUI Tool for Windows

2011-03-26 Thread Edward Capriolo
I don't know. Apache web server is a patchy web server, but crapsandra
just no way to put that in a good light.

On Friday, March 25, 2011, Dario Bravo darbr...@gmail.com wrote:
 People: Crapssandra.
 I'm starting a Cassandra project and starting to learn about this beautiful 
 Cassandra, so I thougth that it would be nice to have a db gui tool under my 
 current OS.
 It doesn't do anything other than showing some info about the server or the 
 selected keyspace... but I hope it'll do many things such as manage 
 keyspaces, column families, columns and super columns, show data contained on 
 columns, allow to perform queries (get, set, mostly), etc.


 If anyone wishes to help in any way, please feel free to download the code 
 and modify it.
 It's called Crapssandra because it started as a crappy simple code and it's 
 features are gonna be developed as I need them... so it will have crappy 
 code, mostly.


 It's done using .net 3.5 and Thrift.
 The address to download it and it's source code 
 is: http://code.google.com/p/crapssandra/


  http://code.google.com/p/crapssandra/Hope this helps someone, that the app 
 grow as I wish, and to get some help from the community.
 Thanks!

 --
 Darío Bravo






Re: Starter GUI Tool for Windows

2011-03-27 Thread Edward Capriolo
On Sun, Mar 27, 2011 at 10:56 AM, Dario Bravo darbr...@gmail.com wrote:
 I'm adding new features today. You can now download it and will be able to
 view keyspaces info and column families.
 I will start to develop a feature to add column families to keyspaces... it
 will take some time, but you can play around with it (for almost a minute,
 before you get bored).



 2011/3/26 Dario Bravo darbr...@gmail.com

 hehe, okay, maybe I'd chosen a bad name... does anybody think a better
 one?
 If you check out the source, it can do a few new things, such as drop
 keyspaces (except system), and show info on selected nodes...
 Tomorrow I'll be adding a bunch of new features, I hope.

 2011/3/26 Edward Capriolo edlinuxg...@gmail.com

 I don't know. Apache web server is a patchy web server, but crapsandra
 just no way to put that in a good light.

 On Friday, March 25, 2011, Dario Bravo darbr...@gmail.com wrote:
  People: Crapssandra.
  I'm starting a Cassandra project and starting to learn about this
  beautiful Cassandra, so I thougth that it would be nice to have a db gui
  tool under my current OS.
  It doesn't do anything other than showing some info about the server or
  the selected keyspace... but I hope it'll do many things such as manage
  keyspaces, column families, columns and super columns, show data contained
  on columns, allow to perform queries (get, set, mostly), etc.
 
 
  If anyone wishes to help in any way, please feel free to download the
  code and modify it.
  It's called Crapssandra because it started as a crappy simple code and
  it's features are gonna be developed as I need them... so it will have
  crappy code, mostly.
 
 
  It's done using .net 3.5 and Thrift.
  The address to download it and it's source code
  is: http://code.google.com/p/crapssandra/
 
 
   http://code.google.com/p/crapssandra/Hope this helps someone, that
  the app grow as I wish, and to get some help from the community.
  Thanks!
 
  --
  Darío Bravo
 
 
 
 



 --
 Darío Bravo





 --
 Darío Bravo




There is @client-dev list that is perfect for these threads.


Re: International language implementations

2011-03-29 Thread Edward Capriolo
On Tue, Mar 29, 2011 at 5:54 PM, A J s5a...@gmail.com wrote:
 Example, taobao.com is a chinese online bid site. All data is chinese
 and they use Mongodb successfully.
 Are there similar installations of cassandra where data is non-latin ?

 I know in theory, it should all work as cassandra has full utf-8
 support. But unless there are real implementations, you cannot be sure
 of the issues related to storing,sorting etc..

 Regards.


 On Tue, Mar 29, 2011 at 5:41 PM, Peter Schuller
 peter.schul...@infidyne.com wrote:
 Can someone list some of the current international language
 implementations of cassandra ?

 What is an international language implementation of Cassandra?

 --
 / Peter Schuller


Keyspace -Java String
ColumnFamily -Java string
Row Key- byte []
column - byte []
value - byte []

So you can encode/store any type of data you like.

As for internationalization, I have not found any NadaSQL groups yet.


Re: How to determine if repair need to be run

2011-03-30 Thread Edward Capriolo
On Wed, Mar 30, 2011 at 12:54 PM, Peter Schuller
peter.schul...@infidyne.com wrote:
 Note this script doesn't work if your repair takes hours, and in the
 middle of the repair cassandra was restarted, nodetool will exit and the
 flagfile will be updated.   Another case, if repair hangs, and day later
 cassandra is restarted.

 This is why set -e is at the to and commented as important :) But
 it relies on 'nodetool repair' reliably exiting with non-zero exit
 status on failures.

 if nodetool returns an error this might work:

  nodetool -h localhost repair  touch /path/to/flagfile.tmp

 That's the equivalent, due to 'set -e'.


 --
 / Peter Schuller


I just wanted to chime in here and say some people NEVER run repair.
In our particular case we remove inactive data older then a specific
date. If we lost a tombstone and that data were to re-appear that
would really not be the end of the world for us. Repair is really
intensive since it involves a compaction and in 0.6.X was not optimal
as it really increased on disk data. I have followed some threads and
there are some conditions that I read repair can't handle. The
question you have to ask yourself is how likely are they to occur and
what they might mean in your use-case. These are not easy questions to
answer.


Re: Two column families or One super column family?

2011-03-31 Thread Edward Capriolo
On Thu, Mar 31, 2011 at 3:52 AM, T Akhayo t.akh...@gmail.com wrote:
 Hi Aaron,

 Thank you for your reply, i appreciate the suggestions you made.

 Yesterday i managed to get everything (our main read) in one CF, with the
 use of a structure in a value like you suggested.

 Designing a new data model is different from what i'm used to, but if you
 keep in mind that you designing for performance instead of flexibility then
 everything gets a bit easier.

 Kind regards,
 T. Akhayo

 2011/3/30 aaron morton aa...@thelastpickle.com

 I would go with the solution that means you only have to make one request
 to serve your reads, so consider the super CF approach.
 There are some downsides to super columns
 see http://wiki.apache.org/cassandra/CassandraLimitations and they tend to
 have a love-them-hate-them reputation.
 One thing to consider is that you do not need to model every attribute of
 your entity as a column in cassandra. Especially if you are always going to
 pull back all the attributes. So you could do your super CF approach with a
 standard CF, just pack the columns into some sort of structure such as JSON
 and store them as a blob.
 Or you can use a naming scheme in the column names with a standard CF,
 e.g. uuid1.text and uuid2.text
 Hope that helps.
 Aaron
 On 30 Mar 2011, at 01:05, T Akhayo wrote:

 Good afternoon,

 I'm making my data model from scratch for cassandra, this means i can tune
 and fine tune it for performance.

 At this time i'm having problems choosing between a 2 column families or 1
 super column family. I will illustrate with a example.

 Sector, this defines a place, this is one or two properties.
 Entry, a entry that is bound to a sector, this is simply some text and a
 few properties.

 I can model this with a super column family:

 sectors{ //super column family
 sector1{
 uid1{
 text: a text
 user: joop
 }
 uid2{
 text: more text
 user: piet
 }
 }
 sector2{
 uid10{
 text: even more text
 user: marie
 }
 }
 }

 But i can also model this with 2 column families:

 sectors{ // column family
 sector1{
 textid1: null
 textid2: null
 }
 sector2{
 textid4: null
 }
 }

 texts{ //column family
 textid1{
 text: a text
 user: joop
 }
 textid2{
 text: more text
 user: piet
 }
 }

 With the super column family i can retrieve a list of texts for a specific
 sector with only 1 request to cassandra.

 With the 2 column families i need to send 2 requests to cassandra:
 1. give me all textids from sector x. (returns x, y, z)
 2. give me all texts that have id x, y, z.

 In my final application it is likely that there will be a bit more writes
 compared to reads.

 I was wondering what the best approach is when it comes to performance. I
 suspect that using super column families is slower compared the using column
 families, but is it stil slower when using 2 column families and with 2
 request to cassandra instead of 1 (with super column family).

 Kind regards,
 T. Akhayo




I decided to write this as a general guide to the topic of
denormalizing things into multiple CF's or not.
 
http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/whytf_would_i_need_with


Re: Not able to set ZERO consistency level

2011-03-31 Thread Edward Capriolo
On Thu, Mar 31, 2011 at 2:53 PM, Peter Schuller
peter.schul...@infidyne.com wrote:
 Only the following Levels are provided, I am wondering if the ZERO
 consistency level is removed in Cassandra 0.7.X ?

 Yes, it's gone.

 If so, Could you please explain why was it removed and what is the best
 option I have given my context.

 https://issues.apache.org/jira/browse/CASSANDRA-1607

 Are you *sure* you want it? :)

 --
 / Peter Schuller


ANY would be the next step up. Beware though of the eventually
consistent boogie man!


Re: Node added, no performance boost -- are the tokens correct?

2011-03-31 Thread Edward Capriolo
On Thu, Mar 31, 2011 at 6:15 PM, Eric Gilmore e...@datastax.com wrote:
 A script that I have says the following:

 $ python ctokens.py
 How many nodes are in your cluster? 2
 node 0: 0
 node 1: 85070591730234615865843651857942052864

 The first token should be zero, for the reasons discussed here:
 http://www.datastax.com/dev/tutorials/getting_started_0_7/configuring#initial-token-values

 More details are available in
 http://www.datastax.com/docs/0.7/operations/clustering#adding-capacity

 The DS docs have some weak areas, but these two pages have been pretty well
 vetted over the past months :)



 On Thu, Mar 31, 2011 at 3:06 PM, buddhasystem potek...@bnl.gov wrote:

 I just configured a cluster of two nodes -- do these token values make
 sense?
 The reason I'm asking that so far I don't see load balancing to be
 happening, judging from performance.

 Address         Status State   Load            Owns    Token

 170141183460469231731687303715884105728
 130.199.185.194 Up     Normal  153.52 GB       50.00%
 85070591730234615865843651857942052864
 130.199.185.193 Up     Normal  199.82 GB       50.00%
 170141183460469231731687303715884105728


 --
 View this message in context:
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Node-added-no-performance-boost-are-the-tokens-correct-tp6228872p6228872.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at
 Nabble.com.



The first token does not really have to be zero. They just have been
spread evenly across the token space.


Re: Ditching Cassandra

2011-03-31 Thread Edward Capriolo
Gregori,
Congrats on writing the fud-liest post of the month award. Firstly if
you don't like updates give up on computers and software. Especally
give up on anything that has to do with nosql because it is fast
evolving.

If you think you have a problem with the cassandra api, then what you
really have a problem with the data model. You should have done more
research nine months ago.

I can not understand from rant exactly what you think is better about
the mongo api. I see the complaint lots of code I suggest books on
design patterns.

It is hardly the fault of cassandra that it works with so many
languages and people create higher level clients and abstractions for
it.

I believe it is a testament to cassandra that many places that are
historically non java shops can pick up ruby or php clients and dive
in.

Also I do not see exactly what is so hard about the api thrift
generates. To me it looks like the memcache api except the value is a
map. I do not see what needs to be wrapped around it to make it
easier... Maybe a factory method to one liner things?


On Wednesday, March 30, 2011, Ashlee Saunders
ashlee.saund...@aswebco.com.au wrote:
 Thanks for the feedback Grgori,
 We in Australia are only concerned with solutions as we are a solutions 
 focused organization. With respect to your feedback, you and your team seem 
 to have identified no solutions other than jumping ship. When we subscribed 
 to the 50 or so emails per day, we wanted to contribute solutions to the 
 Cassandra community rather than dwell on problems.

 I have enjoyed following the team on this project, and they have been very 
 solutions focused. Please refrain from contributing negatively. Find 
 solutions to the Cassandra project.

 To the rest, please keep up the great work.

 Ashlee Saunders

 On 31/03/2011, at 7:19 AM, Ed Anuff e...@anuff.com wrote:

 My concern when I see something like this is it might cause developers
 on the project to get worried and start to try to solve the wrong
 problems.  Cassandra is not going to be as easy as Mongo, certainly
 not any time soon.  CQL won't do it, although it will help.  This
 isn't a criticism of Cassandra or CQL though.  Cassandra isn't here to
 compete with Mongo on ease of use, it's here to compete on
 scalability.  Secondly, the client libraries are not a mess.  Some
 might be, some are not - Hector, which is the one I contribute to, is
 pretty good.  Client libraries aren't going away.  People are still
 building client libraries on top of SQL four decades later, we just
 call them ORM or middleware.  Cassandra's data model is by necessity
 somewhat complicated, and most of the client libraries are going to
 have to be more than wrappers around Thrift or easy ways to send CQL.
 There's where Hector is going, it has a lightweight JPA implementation
 and it's going to have a very robust implementation soon.  Honestly,
 the only criticism by the OP that should be taken to heart is
 stability.  Cassandra can be the hardest database in the world to use
 and still succeed, but it has to be rock solid at all levels of scale,
 and that has to be the focus in the near term.

 On Tue, Mar 29, 2011 at 5:11 PM, Gregori Schmidt grokd...@gmail.com wrote:
 hi,
 After using Cassandra during development for the past 8 months my team and I
 made the decision to switch from Cassandra to MongoDB this morning.  I
 thought I'd share some thoughts on why we did this and where Cassandra might
 benefit from improvement.



Re: nodetool cfstathistogram error

2011-03-31 Thread Edward Capriolo
On Thu, Mar 31, 2011 at 8:25 PM, mcasandra mohitanch...@gmail.com wrote:
 It looks like if I use system schema it fails. Is it because of
 LocalPartitioner?

 I ran with other keyspace and got following output.

 Offset SSTables Write Latency Read Latency Row Size Column Count
 1 0 0 0 0 0
 2 0 0 0 0 0
 179 0 0 0 320 320


 Can someone please help me understand the output in first 2 columns? Why are
 SSTables always 0?

 I am writing shell/awk scripts to parse this data and send it out to
 monitoring tool.

 So far I am planning to monitor output of netstat, tpstat and cfhistograms.
 Is there anything else I should monitor that might be helpful?

 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/nodetool-cfstathistogram-error-tp6228995p6229038.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.


The system schema would does not work, and probably would not produce
any interesting output if it did.


Re: Node added, no performance boost -- are the tokens correct?

2011-04-01 Thread Edward Capriolo
On Fri, Apr 1, 2011 at 1:15 PM, Peter Schuller
peter.schul...@infidyne.com wrote:
 Now, I moved the tokens. I still observe that read latency deteriorated with
 3 machines vs original one. Replication factor is 1, Cassandra version 0.7.2
 (didn't have time to upgrade as I need results by this weekend).

 Read *latency* is fully expected to increase if you just add a node.
 *Throughput* should increase, unless you have a workload that manages
 to be more expensive on RPC than actual reads/writes.

 Latency would only be improved by additional nodes under some significant 
 load.

 How are you benchmarking? Are you concurrently submitting requests to
 all nodes at the same time? Try using stress.py from the Cassandra
 tree as a comparison.

 If you're sending one request at a time, there is no expectation at
 all of a performance improvement - just a decrease in performance.

 --
 / Peter Schuller


To be clear on this issue. It does not matter where the tokens start
it only matters that they are equally spaced around the token space.

So for a 4 node clusters your tokens should either be
1 * ((2^127) / 4) = 42535295865117307932921825928971026432
2 * ((2^127) / 4) = 85070591730234615865843651857942052864
3 * ((2^127) / 4) = 127605887595351923798765477786913079296
4 * ((2^127) / 4) = 170141183460469231731687303715884105728

or
0 * ((2^127) / 4) = 0
1 * ((2^127) / 4) = 42535295865117307932921825928971026432
2 * ((2^127) / 4) = 85070591730234615865843651857942052864
3 * ((2^127) / 4) = 127605887595351923798765477786913079296

If you move one you have to move the rest because the distance between
170141183460469231731687303715884105728 and 0 is 1


Re: Bizarre side-effect of increasing read concurrency

2011-04-01 Thread Edward Capriolo
On Fri, Apr 1, 2011 at 11:27 PM, Jason Harvey alie...@gmail.com wrote:
 On further analysis, it looks like this behavior occurs when a node is
 simply restarted. Is that normal behavior? If mark-and-sweep becomes
 less and less effective over time, does that suggest an issue with GC,
 or an issue with memory use?

 On Apr 1, 8:21 pm, Jason Harvey alie...@gmail.com wrote:
 After increasing read concurrency from 8 to 64, GC mark-and-sweep was
 suddenly able to reclaim much more memory than it previously did.

 Previously, mark-and-sweep would run around 5.5GB, and would cut heap
 usage to 4GB. Now, it still runs at 5.5GB, but it shrinks all the way
 down to 2GB used. This behavior was consistent in every machine I
 increased read-concurrent on.

 Any thoughts on why this behavior changed? No other diagnostics
 appeared to correlate to the concurrency change, besides thread count.


Jason,

First you do not need to restart to adjust concurrent readers. It can
be done from JMX without restart.

As for the memory, after you restart you may have drained your caches
and memtables which explains why less memory is used.

Java also enjoys using all the memory your allocate and the Garbage
collection does not give it back unless it needs to.

Edward


Re: Endless minor compactions after heavy inserts

2011-04-03 Thread Edward Capriolo
On Sun, Apr 3, 2011 at 1:46 PM, Sheng Chen chensheng2...@gmail.com wrote:
 I think if i can keep a single sstable file in a proper size, the hot
 data/index files may be able to fit into memory at least in some occasions.

 In my use case, I want to use cassandra for storage of a large amount of log
 data.
 There will be multiple nodes, and each node has 10*2TB disks to hold as much
 data as possible, ideally 20TB (about 100 billion rows) in one node.
 Reading operations will be much less than writing. A reading latency within
 1 second is acceptable.

 Is it possible? Do you have advice on this design?
 Thank you.

 Sheng



 2011/4/3 aaron morton aa...@thelastpickle.com

 With only one data file your reads would use the least amount of IO to
 find the data.
 Most people have multiple nodes and probably fewer disks, so each node may
 have a TB or two of data. How much capacity do your 10 disks give ? Will you
 be running multiple nodes in production ?
 Aaron


 On 2 Apr 2011, at 12:45, Sheng Chen wrote:

 Thank you very much.
 The major compaction will merge everything into one big file., which would
 be very large.
 Is there any way to control the number or size of files created by major
 compaction?
 Or, is there a recommended number or size of files for cassandra to
 handle?
 Thanks. I see the trigger of my minor compaction is OperationsInMillions.
 It is a number of operations in total, which I thought was in a second.
 Cheers,
 Sheng

 2011/4/1 aaron morton aa...@thelastpickle.com

 If you are doing some sort of bulk load you can disable minor compactions
 by setting the min_compaction_threshold and max_compaction_threshold to 0 .
 Then once your insert is complete run a major compaction via nodetool before
 turning the minor compaction back on.

 You can also reduce the compaction threads priority, see
 compaction_thread_priority in the yaml file.

 The memtable will be flushed when either the MB or ops throughput is
 triggered. If you are seeing a lot of memtables smaller than the MB
 threshold then the ops threshold is probably been triggered. Look for a log
 message at INFO level starting with Enqueuing flush of Memtable that will
 tell you how many bytes and ops the memtable had when it was flushed. Trying
 increasing the ops threshold and see what happens.

 You're change in the compaction threshold may not have an an effect
 because the compaction process was already running.

 AFAIK the best way to get the best out of your 10 disks will be to use a
 dedicated mirror for the commit log and a  stripe set for the data.

 Hope that helps.
 Aaron

 On 1 Apr 2011, at 14:52, Sheng Chen wrote:

  I've got a single node of cassandra 0.7.4, and I used the java stress
  tool to insert about 100 million records.
  The inserts took about 6 hours (45k inserts/sec) but the following
  minor compactions last for 2 days and the pending compaction jobs are 
  still
  increasing.
 
  From jconsole I can read the MemtableThroughputInMB=1499,
  MemtableOperationsInMillions=7.0
  But in my data directory, I got hundreds of 438MB data files, which
  should be the cause of the minor compactions.
 
  I tried to set compaction threshold by nodetool, but it didn't seem to
  take effects (no change in pending compaction tasks).
  After restarting the node, my setting is lost.
 
  I want to distribute the read load in my disks (10 disks in xfs, LVM),
  so I don't want to do a major compaction.
  So, what can I do to keep the sstable file in a reasonable size, or to
  make the minor compactions faster?
 
  Thank you in advance.
  Sheng
 






Consider implications of
http://wiki.apache.org/cassandra/LargeDataSetConsiderations


Re: Embedding Cassandra in Java code w/o using ports

2011-04-04 Thread Edward Capriolo
On Mon, Apr 4, 2011 at 8:29 AM, aaron morton aa...@thelastpickle.com wrote:
 I'm interested to know more about the problems using the CLI.

 Aaron.

 On 2 Apr 2011, at 15:07, Bob Futrelle wrote:

 Connecting via CLI to local host with a port number has never been 
 successful for me in Snow Leopard.  No amount of reading suggestions and 
 varying the approach has worked.  So I'm going to talk to Cassandra via its 
 API, from Java.

 But I noticed that in some code samples that call the API from Java, ports 
 are also in play.  In using Derby in Java I've never had to designate any 
 ports.  Is such a  strategy available with Cassandra?

  - Bob Futrelle
    Northeastern U.




I realize you do not want to open ports at all. One thing I do is
leverage the private loop back addresses that are on each computer
127.0.0.1,127.0.0.2-127.255.255.254.


Re: selecting random columns ..

2011-04-08 Thread Edward Capriolo
On Fri, Apr 8, 2011 at 4:48 AM, Sasha Dolgy sdo...@gmail.com wrote:
 hi all,

 is there a way to select random columns from a key?

 --
 Sasha Dolgy
 sasha.do...@gmail.com


getRangeSlice with random column start key.


Re: database design

2011-04-13 Thread Edward Capriolo
On Wed, Apr 13, 2011 at 10:39 AM, Jean-Yves LEBLEU jleb...@gmail.com wrote:
 Hi all,

 Just some thoughts and question I have about cassandra data modeling.

 If I understand well, cassandra is better on writing than on reading.
 So you have to think about your queries to design cassandra schema. We
 are doing incremental design, and already have our system in
 production and we have to develop new queries.
 How do you usualy do when you have new queries, do you write a
 specific job to update data in the database to match the new query you
 are writing ?

 Thanks for your help.

 Jean-Yves


Good point, Generally you will need to write some type of range
scanning/map reduce application to process and back fill your data.


Re: Quick Poll: Server names

2010-07-27 Thread Edward Capriolo
On Tue, Jul 27, 2010 at 11:49 AM, uncle mantis uncleman...@gmail.com wrote:
 Ah S**T! The Pooh server is is down again! =)

 What does one do if they run out of themed names?

 Regards,

 Michael


 On Tue, Jul 27, 2010 at 10:46 AM, Brett Thomas brettptho...@gmail.com
 wrote:

 I like names of colleges

 On Tue, Jul 27, 2010 at 11:40 AM, Dave Viner davevi...@pobox.com wrote:

 I've seen  used several...
 names of children of employees of the company
 names of streets near office
 names of diseases (lead to very hard to spell names after a while, but
 was quite educational for most developers)
 names of characters from famous books (e.g., lord of the rings, asimov
 novels, etc)


 On Tue, Jul 27, 2010 at 7:54 AM, uncle mantis uncleman...@gmail.com
 wrote:

 I will be naming my servers after insect family names. What do you all
 use for yours?

 If this is something that is too off topic please contact a moderator.

 Regards,

 Michael





I know this is a fun thread, and I hate being a debby downer
but...In my opinion, naming servers after anything then their function
is not a great idea. Lets look at some positives and negatives:

System1:
cassandra01
cassandra02
cassandra03

VS

System2:
tom
dick
harry

Forward and reverse DNS:

System1 is easy to mange with the server number you can easily figure
out an offset.
System2 requires careful mapping and will be more error prone.

The future:
So way back when a company i was at used Native American tribe names.
Guess what happened. At about 20 nodes we ran out of common names like
Cherokee, and we had servers named choctaw. These names become hard to
spell and hard to say. Once you run out of native American names and
you start using 'country names' What is the point? It is not even a
convention any more. Cassandra servers are named after Native
Americans, or possible food, or possibly a dog.

Quick someone... fido just went down? What does fido do? Is it
important? Is it in our web cluster or are cassandra cluster?

Someone about mentioned Chevron1 till Chevron9. Look then ran out of
unique names after the 5th server. So essentially 5 unique fun names
then chevron6-1000.  Why is chevron6-1000 better then cassandra6-1000
and is it any more fun?

Reboots:
Have you ever called a data center at 1AM for a server reboot? Picking
a fancy, non phonetic name is a great way for a tired NOC operator to
reboot the wrong one.


Re: how to recover cassandra data

2010-08-02 Thread Edward Capriolo
On Mon, Aug 2, 2010 at 9:11 AM, john xie shanfengg...@gmail.com wrote:
 ReplicationFactor = 3
 one day i stop 192.168.1.147 and remove cassandra data by mistake, can i
 recover  192.168.1.147's cassadra data by restart cassandra ?


    DataFileDirectories
         DataFileDirectory/data1/cassandra//DataFileDirectory
         DataFileDirectory/data2/cassandra//DataFileDirectory
         DataFileDirectory/data3/cassandra//DataFileDirectory
     /DataFileDirectories
 /data3  mount  /dev/sdd
 i remove /data3 and  formatt /dev/sdd

 Address       Status     Load          Range
      Ring

 135438270110006521520577363629178401179
 192.168.1.148 Up         50.38 GB
  5243502939295338512484974245382898     |--|
 192.168.1.145 Up         48.38 GB
  63161078970569359253391371326773726097     |   |
 192.168.1.147 ?          23.5 GB
 79546317728707787532885001681404757282     |   |
 192.168.1.146 Up         26.34 GB
  135438270110006521520577363629178401179    |--|







Since you have a replication factor of three if you bring a new node
through auto-bootstrap data will migrate back to it since it has two
copies. Nothing is lost.


Re: unable to start cassandra

2010-08-03 Thread Edward Capriolo
On Tue, Aug 3, 2010 at 10:47 AM, Maciej Lisowski
m.lisow...@powerprice.pl wrote:
 Hi all,

 I’m new here and new with Cassandra and I’ve got problem to run it (v.
 0.6.4) with jdk1.6.0_21.

 When I type “cassandra” to run it I get error:



 ERROR 16:23:53,803 Uncaught exception in thread
 Thread[ROW-MUTATION-STAGE:5,5,main]

 java.util.concurrent.ExecutionException: java.lang.RuntimeException:
 java.lang.NullPointerException

     at
 java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)

     at java.util.concurrent.FutureTask.get(FutureTask.java:83)

     at
 org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.afterExecute(DebuggableThreadPoolExecutor.java:86)

     at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:888)

     at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

     at java.lang.Thread.run(Thread.java:619)

 Caused by: java.lang.RuntimeException: java.lang.NullPointerException

     at
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)

     at
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)

     at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

     at java.util.concurrent.FutureTask.run(FutureTask.java:138)

     at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

     ... 2 more

 Caused by: java.lang.NullPointerException

     at
 org.apache.cassandra.db.Table$TableMetadata.getColumnFamilyId(Table.java:131)

     at org.apache.cassandra.db.Table.getColumnFamilyId(Table.java:364)

     at
 org.apache.cassandra.db.commitlog.CommitLog$4.runMayThrow(CommitLog.java:256)

     at
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)

     ... 6 more



 I was looking for info what could happen but I didn’t find… can someone help
 me with this?

 If I have to send something more (configuration or whatever) let me know



 Maciek

Something similar happened to me when upgrading from 6.1 - 6.2. Even
though the on-disk format of the SSTABLES is the same, sometimes the
wire-format serialization of Future Tasks change. If that is the case,
it means that the upgrade can NOT be done with a rolling restart. I am
not sure this is the case here but that might help.

Edward


Re: unable to start cassandra

2010-08-03 Thread Edward Capriolo
On Tue, Aug 3, 2010 at 11:44 AM, Edward Capriolo edlinuxg...@gmail.com wrote:
 On Tue, Aug 3, 2010 at 10:47 AM, Maciej Lisowski
 m.lisow...@powerprice.pl wrote:
 Hi all,

 I’m new here and new with Cassandra and I’ve got problem to run it (v.
 0.6.4) with jdk1.6.0_21.

 When I type “cassandra” to run it I get error:



 ERROR 16:23:53,803 Uncaught exception in thread
 Thread[ROW-MUTATION-STAGE:5,5,main]

 java.util.concurrent.ExecutionException: java.lang.RuntimeException:
 java.lang.NullPointerException

     at
 java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)

     at java.util.concurrent.FutureTask.get(FutureTask.java:83)

     at
 org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.afterExecute(DebuggableThreadPoolExecutor.java:86)

     at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:888)

     at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

     at java.lang.Thread.run(Thread.java:619)

 Caused by: java.lang.RuntimeException: java.lang.NullPointerException

     at
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)

     at
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)

     at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

     at java.util.concurrent.FutureTask.run(FutureTask.java:138)

     at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

     ... 2 more

 Caused by: java.lang.NullPointerException

     at
 org.apache.cassandra.db.Table$TableMetadata.getColumnFamilyId(Table.java:131)

     at org.apache.cassandra.db.Table.getColumnFamilyId(Table.java:364)

     at
 org.apache.cassandra.db.commitlog.CommitLog$4.runMayThrow(CommitLog.java:256)

     at
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)

     ... 6 more



 I was looking for info what could happen but I didn’t find… can someone help
 me with this?

 If I have to send something more (configuration or whatever) let me know



 Maciek

 Something similar happened to me when upgrading from 6.1 - 6.2. Even
 though the on-disk format of the SSTABLES is the same, sometimes the
 wire-format serialization of Future Tasks change. If that is the case,
 it means that the upgrade can NOT be done with a rolling restart. I am
 not sure this is the case here but that might help.

 Edward


Sorry. Mis-read on my part. It does not look like you are doing an
upgrade. Dis-regard my comments.


Growing commit log directory.

2010-08-09 Thread Edward Capriolo
I have a 16 node 6.3 cluster and two nodes from my cluster are giving
me major headaches.

10.71.71.56   Up 58.19 GB
10827166220211678382926910108067277|   ^
10.71.71.61   Down   67.77 GB
123739042516704895804863493611552076888v   |
10.71.71.66   Up 43.51 GB
127605887595351923798765477786913079296|   ^
10.71.71.59   Down   90.22 GB
139206422831293007780471430312996086499v   |
10.71.71.65   Up 22.97 GB
148873535527910577765226390751398592512|   ^

The symptoms I am seeing are nodes 61 and nodes 59 have huge 6 GB +
commit log directories. They keep growing, along with memory usage,
eventually the logs start showing GCInspection errors and then the
nodes will go OOM

INFO 14:20:01,296 Creating new commitlog segment
/var/lib/cassandra/commitlog/CommitLog-1281378001296.log
 INFO 14:20:02,199 GC for ParNew: 327 ms, 57545496 reclaimed leaving
7955651792 used; max is 9773776896
 INFO 14:20:03,201 GC for ParNew: 443 ms, 45124504 reclaimed leaving
8137412920 used; max is 9773776896
 INFO 14:20:04,314 GC for ParNew: 438 ms, 54158832 reclaimed leaving
8310139720 used; max is 9773776896
 INFO 14:20:05,547 GC for ParNew: 409 ms, 56888760 reclaimed leaving
8480136592 used; max is 9773776896
 INFO 14:20:06,900 GC for ParNew: 441 ms, 58149704 reclaimed leaving
8648872520 used; max is 9773776896
 INFO 14:20:08,904 GC for ParNew: 462 ms, 59185992 reclaimed leaving
8816581312 used; max is 9773776896
 INFO 14:20:09,973 GC for ParNew: 460 ms, 57403840 reclaimed leaving
8986063136 used; max is 9773776896
 INFO 14:20:11,976 GC for ParNew: 447 ms, 59814376 reclaimed leaving
9153134392 used; max is 9773776896
 INFO 14:20:13,150 GC for ParNew: 441 ms, 61879728 reclaimed leaving
9318140296 used; max is 9773776896
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid10913.hprof ...
 INFO 14:22:30,620 InetAddress /10.71.71.66 is now dead.
 INFO 14:22:30,621 InetAddress /10.71.71.65 is now dead.
 INFO 14:22:30,621 GC for ConcurrentMarkSweep: 44862 ms, 261200
reclaimed leaving 9334753480 used; max is 9773776896
 INFO 14:22:30,621 InetAddress /10.71.71.64 is now dead.

Heap dump file created [12730501093 bytes in 253.445 secs]
ERROR 14:28:08,945 Uncaught exception in thread Thread[Thread-2288,5,main]
java.lang.OutOfMemoryError: Java heap space
at 
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:71)
ERROR 14:28:08,948 Uncaught exception in thread Thread[Thread-2281,5,main]
java.lang.OutOfMemoryError: Java heap space
at 
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:71)
 INFO 14:28:09,017 GC for ConcurrentMarkSweep: 33737 ms, 85880
reclaimed leaving 9335215296 used; max is 9773776896

Does anyone have any ideas what is going on?


Re: Growing commit log directory.

2010-08-09 Thread Edward Capriolo
On Mon, Aug 9, 2010 at 8:20 PM, Jonathan Ellis jbel...@gmail.com wrote:
 what does tpstats or other JMX monitoring of the o.a.c.concurrent stages show?

 On Mon, Aug 9, 2010 at 4:50 PM, Edward Capriolo edlinuxg...@gmail.com wrote:
 I have a 16 node 6.3 cluster and two nodes from my cluster are giving
 me major headaches.

 10.71.71.56   Up         58.19 GB
 10827166220211678382926910108067277    |   ^
 10.71.71.61   Down       67.77 GB
 123739042516704895804863493611552076888    v   |
 10.71.71.66   Up         43.51 GB
 127605887595351923798765477786913079296    |   ^
 10.71.71.59   Down       90.22 GB
 139206422831293007780471430312996086499    v   |
 10.71.71.65   Up         22.97 GB
 148873535527910577765226390751398592512    |   ^

 The symptoms I am seeing are nodes 61 and nodes 59 have huge 6 GB +
 commit log directories. They keep growing, along with memory usage,
 eventually the logs start showing GCInspection errors and then the
 nodes will go OOM

 INFO 14:20:01,296 Creating new commitlog segment
 /var/lib/cassandra/commitlog/CommitLog-1281378001296.log
  INFO 14:20:02,199 GC for ParNew: 327 ms, 57545496 reclaimed leaving
 7955651792 used; max is 9773776896
  INFO 14:20:03,201 GC for ParNew: 443 ms, 45124504 reclaimed leaving
 8137412920 used; max is 9773776896
  INFO 14:20:04,314 GC for ParNew: 438 ms, 54158832 reclaimed leaving
 8310139720 used; max is 9773776896
  INFO 14:20:05,547 GC for ParNew: 409 ms, 56888760 reclaimed leaving
 8480136592 used; max is 9773776896
  INFO 14:20:06,900 GC for ParNew: 441 ms, 58149704 reclaimed leaving
 8648872520 used; max is 9773776896
  INFO 14:20:08,904 GC for ParNew: 462 ms, 59185992 reclaimed leaving
 8816581312 used; max is 9773776896
  INFO 14:20:09,973 GC for ParNew: 460 ms, 57403840 reclaimed leaving
 8986063136 used; max is 9773776896
  INFO 14:20:11,976 GC for ParNew: 447 ms, 59814376 reclaimed leaving
 9153134392 used; max is 9773776896
  INFO 14:20:13,150 GC for ParNew: 441 ms, 61879728 reclaimed leaving
 9318140296 used; max is 9773776896
 java.lang.OutOfMemoryError: Java heap space
 Dumping heap to java_pid10913.hprof ...
  INFO 14:22:30,620 InetAddress /10.71.71.66 is now dead.
  INFO 14:22:30,621 InetAddress /10.71.71.65 is now dead.
  INFO 14:22:30,621 GC for ConcurrentMarkSweep: 44862 ms, 261200
 reclaimed leaving 9334753480 used; max is 9773776896
  INFO 14:22:30,621 InetAddress /10.71.71.64 is now dead.

 Heap dump file created [12730501093 bytes in 253.445 secs]
 ERROR 14:28:08,945 Uncaught exception in thread Thread[Thread-2288,5,main]
 java.lang.OutOfMemoryError: Java heap space
        at 
 org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:71)
 ERROR 14:28:08,948 Uncaught exception in thread Thread[Thread-2281,5,main]
 java.lang.OutOfMemoryError: Java heap space
        at 
 org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:71)
  INFO 14:28:09,017 GC for ConcurrentMarkSweep: 33737 ms, 85880
 reclaimed leaving 9335215296 used; max is 9773776896

 Does anyone have any ideas what is going on?




 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of Riptano, the source for professional Cassandra support
 http://riptano.com


Hey guys thanks for the help. I had lowered my Xmx from 12GB to 10xmx
because I saw:

[r...@cdbsd09 ~]# /usr/local/cassandra/bin/nodetool --host 10.71.71.59
--port 8585 info
123739042516704895804863493611552076888
Load : 68.91 GB
Generation No: 1281407425
Uptime (seconds) : 1459
Heap Memory (MB) : 6476.70 / 12261.00

This was happening:
[r...@cdbsd11 ~]# /usr/local/cassandra/bin/nodetool --host
cdbsd09.hadoop.pvt --port 8585 tpstats
Pool NameActive   Pending  Completed
STREAM-STAGE  0 0  0
RESPONSE-STAGE0 0  16478
ROW-READ-STAGE   64  4014  18190
LB-OPERATIONS 0 0  0
MESSAGE-DESERIALIZER-POOL 0 0  60290
GMFD  0 0385
LB-TARGET 0 0  0
CONSISTENCY-MANAGER   0 0   7526
ROW-MUTATION-STAGE   64   908 182612
MESSAGE-STREAMING-POOL0 0  0
LOAD-BALANCER-STAGE   0 0  0
FLUSH-SORTER-POOL 0 0  0
MEMTABLE-POST-FLUSHER 0 0  8
FLUSH-WRITER-POOL 0 0  8
AE-SERVICE-STAGE  0 0  0
HINTED-HANDOFF-POOL   1 9  6

After raising the level I realized I was maxing out the heap. The
other nodes are running fine with xmx9GB but I guess these nodes can
not.

Thanks again.
Edward


a plea not to remove rowsize warning

2010-08-11 Thread Edward Capriolo
Hello all,

I recently posted on list about a situation where two of my nodes from
my 16 node were garbage collecting and at ooming. I was able to move
my xmx from 9gb to 11gb to see that rather then my memory saw tooth. I
would saw tooth around 4 gb before memory shot up like a rocket.

After digging around I noticed the jmx row stats on that node said
maxrowcompacted size = 128 mb. While the mean row size was 2000 byes.

At the time I was unaware of the setting that warns of large rows.
During compaction. Unfortunately this setting is too high by default.
512 mb, since I have been using rowcache.

When something get this key extreme memory pressure is put on the
system to get it in and out of row cache.

I wa able to lower this setting to 10 mb and a got printed nice
warnings showing me the offending keys. I do not know how this got
their. My guess is  null is getting encoded into this key and this key
becomes the graveyard for bad data.

Until the rowcache can handle the large keys better I find it
imperitive to keep the setting and the warnings. As making a program
to range scan all the data to find one big. Key is very intensive.


Re: indexing rows ordered by int

2010-08-15 Thread Edward Capriolo
On Sunday, August 15, 2010, S Ahmed sahmed1...@gmail.com wrote:
 For CF that I need to perform range scans on, I create separate CF that have 
 custom ordering.
 Say a CF holds comments on a story (like comments on a reddit or digg story 
 post)
 So if I need to order comments by votes, it seems I have to re-index every 
 time someone votes on a comment (or batch it every x minutes).



 Right now I think I have to pull all the comments into memory, then sort by 
 votes, then re-write the index.
 Are there any best-practises for this type of index?

It seems that most stories will have few comments 1-100. If you are
only looking to order comments on a given article by vote this seems
like something you would want to store with the article and or
calculate on the fly.

Unless you were looking for a feature like ,show highest rated comment
across all articles, I do not understand why you would need a separate
cf.
Does my suggestion make sense ?if not, can share your storage.xml ?


Hive Storage Handler for Cassandra

2010-08-16 Thread Edward Capriolo
Hello,

Anyone interested in doing map/reduce on Cassandra data should take a
look at Cassandra Storage Handler for Hive. Storage handlers give Hive
the ability to work with data outside HDFS in a more natural way.
Support is now in place for reading and writing to/from Standard
Column Families (no super column support yet). While this allows users
to use an SQL like language on their Cassandra data, it does NOT do
things like push down of a where clause into sub-second queries.

https://issues.apache.org/jira/browse/HIVE-1434

For those looking to try this out with minimal effort, I have a tar
bundle with cassandra, hive, and hadoop here:

http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/test_drive_hive_cassandra_integration

::Warning::
The bundle is a pre-release build of Hive with cassandra support.
Treat it as such.

Enjoy,
Edward


Re: cache sizes using percentages

2010-08-17 Thread Edward Capriolo
On Tue, Aug 17, 2010 at 1:55 PM, Artie Copeland yeslinux@gmail.com wrote:
 if i set a key cache size of 100% the way i understand how that works is:
 - the cache is not write through, but read through
 - a key gets added to the cache on the first read if not already available
 - the size of the cache will always increase for ever item read.  so if you
 have 100mil items your key cache will grow to 100mil
 Here are my questions:
 if that is the case then what happens if you only have enough mem to store
 10mil items in your key cache?
 do you lose the other 90% how is it determined what is removed?
 will the server keep adding til it gets OOM?
 if you add a row cache as well how does that affect your percentage?
 if there a priority between the cache? or are they independant so both will
 try to be satisfied which would result in an OOM?
 thanx,
 artie
 --
 http://yeslinux.org
 http://yestech.org


Artie,

In my experience, what ends up happening.. You start your server and
all is well, your cache builds up, cache hit rate keeps climbing! Of
course so does memory usage. At some point you start reaching your
XMX. Java keeps trying to garbage collect often. A couple things can
happen, all of them bad. One is just hitting an OOM. Another thing
that can happen is the JVM spends too much time garbage collection and
so little time processing its throws another exception (might be a
subtype of OOM).

 do you lose the other 90% how is it determined what is removed?
Items are removed when full is reached actual memory usage is NOT
taken into account.

if you add a row cache as well how does that affect your percentage?
Mutually exclusive.

 if there a priority between the cache?
No


<    1   2   3   4   5   6   7   8   >