Thanks for your quick response.

> Using the timestamp dimension of columns, "reusing" columns, could prove
> quite useful. This allows simple use of batch_mutate deletes (new in
> 0.6) to purge old data outside the active time window.

   Interesting, while draft modeling the app in Cassandra wasn't aware
of the timestamping.

> Otherwise, performance wise, deletes and "updates" are the same in
> Cassandra (see
> http://spyced.blogspot.com/2010/02/distributed-deletes-in-cassandra.html).

   Yes, I saw @spyced blog post, that's why I asked for the community opinion.

> Data should be spread out over the ring, so load distribution is
> constant regardless of time or "burst peaks".

   This means that we should stick to the RandomPartitioner (RP), when
developing our model?

> A separate location cache, using a counting/timestamped bloom filter
> might be useful too, depending on your app, data structures, and
> throughput requirements. This should be kept outside cassandra and in
> RAM (redis or even memcache would fit nicely, but a simple RPC service
> would be faster). Something like such would allow you to build a tuned
> sliding-window type cache to ensure reads are minimized.

   You suggest a Bloom filter based approach for cheap cache hits?

   We are big fans of Redis, already using it with Resque (jobs) also
we considered modeling our problem just using it alone, but it is a
kind of radical approach.

> Rinse, refactor, repeat, until fast enough and/or job is done ...


>>     - Can we keep this "data window" approach, or will a high rate of
>> delete pose a problem?
> Delete and "insert" are both mutations, so if you can do one, you can do
> the other in ~ the same time. IOW, your rate of mutations in a
> one-in-one-out scenario is simply 2 * insert-rate.


> Due to the nature of deletes, you need to plan for storing "deleted"
> data until compaction though. The compaction phase itself will probably
> need accounting for, but that too is predictable.

    Does "compaction" affects performance? Takes the node off? I have
read about it, and keeping the files small, but I am still not sure
about the impact of this operation in the op.

> 0.6 has improved caching for reads, but if your app truly needs high
> performance reads, some kind of application-tuned cache frontend (as
> mentioned above) is not a bad thing. For sliding-window time series,
> it's hard to beat a simple bloom-filter based cache without reaching for
> complexity.

   According to the mailing list the improvements in 0.6 make it a
"must" upgrade path.

   Then, can we think on RAM behing a helper more than a constrain?

   My point: Once data/indexes in a RDBMS starts to be bigger that
RAM, performance goes down, and then you have to juggle with data,
and or tune it. Is this the case with Cassandra? this is an
architectural point I don't have clear.

> It's certainly possible to decouple purging from insertion in Cassandra,
> but there's no generic "this is how you do it" answer.
> This, IMHO, is a good thing though.

   Thanks a lot for your help,

Aníbal Rojas
Ruby on Rails Web Developer

Reply via email to