Sounds like you have a similar configuration to us.

We have 6 EC2 small instances, with EBS for storage.

Nothing scientific for benchmarks right now, but typically we can retrieve
60,000 columns scattered across 3600 row keys in about 7-10 seconds.

Writes haven't been a bottleneck at all.

I also have a key distribution issue similar to what you describe. So I will
be attempting the same recipe as you shortly.

I'm very interested in what your experiences are running Cassandra on EC2.

Ryan

On Wed, Jan 13, 2010 at 2:26 PM, Anthony Molinaro <
antho...@alumni.caltech.edu> wrote:

> Hi,
>
>  So after several days of more close examination, I've discovered
> something.  EC2 io performance is pretty bad.  Well okay, we already
> all knew that, and I have no choice but to deal with it, as moving
> at this time is not an option.  But what I've really discovered is
> my data is unevenly distributed which I believe is a result of using
> random partitioning without specifying tokens.  So what I think I can
> do to solve this is upgrade to 0.5.0rc3, add more instances, and use
> the tools to modify token ranges.   Towards that end I had a few
> questions about different topics.
>
> Data gathering:
>
>  When I run cfstats I get something like this
>
>  Keyspace: XXXXXXXX
>    Read Count: 39287
>    Read Latency: 14.588 ms.
>    Write Count: 13930
>    Write Latency: 0.062 ms.
>
>  on a heavily loaded node and
>
>  Keyspace: XXXXXXXX
>    Read Count: 8672
>    Read Latency: 1.072 ms.
>    Write Count: 2126
>    Write Latency: 0.000 ms.
>
>  on a lightly loaded node, but my question is what is the timeframe
>  of the counts?  Does a read count of 8K say that 8K reads are currently
>  in progress, or 8K since the last time I check or 8K for some interval?
>
> Data Striping:
>
>  One option I have is to add additional ebs volumes, then either turn
>  on raid0 across several ebs's or possibly just add additional
>  <DataFileDirectory> elements to my config?  If I were to add
>  <DataFileDirectory> entries, can I just move sstable's between
>  directories?  If so I assume I want the Index, Filter and Data files
>  to be in the same directory?  Or is this data movement something
>  Cassandra will do for me?  Also, is this likely to help?
>
> Upgrades:
>
>  I understand that to upgrade from 0.4.x to 0.5.x I need to do something
>  like
>
>    1. turn off all writes to a node
>    2. call 'nodeprobe flush' on that node
>    3. restart node with version 0.5.x
>
>  Is this correct?
>
> Data Repartitioning:
>
>  So it seems that if I first upgrade my current nodes to 0.5.0, then
>  bring up some new nodes with AutoBootstrap on, they should take some
>  data from the most loaded machines?  But lets say I just want to first
>  even out the load on existing nodes, would the process be something like
>
>    1. calculate ideal key ranges (ie, i * (2**127 /N) for i=1..N)
>         (this seems like the ideal candidate for a new tool included
>          with cassandra).
>    2. foreach node
>         'nodeprobe move' to ideal range
>    3. foreach node
>         'nodeprobe clean'
>
>  Alternatively, it looks like I might be able to use 'nodeprobe
> loadbalance'
>  for step 2, and not use step 1?
>
> Also, anyone else running in EC2 and have any sort of tuning tips?
>
> Thanks,
>
> -Anthony
>
> --
> ------------------------------------------------------------------------
> Anthony Molinaro                           <antho...@alumni.caltech.edu>
>

Reply via email to