On Mon, Feb 22, 2010 at 1:40 PM, Sonny Heer <sonnyh...@gmail.com> wrote:
> Hey, > > We are in the process of implementing a cassandra application service. > > we have already ingested TB of data using the cassandra bulk loader > (StorageService). > > One of the requirements is to get a data explosion factor as a result of > denormalization. Since the writes are going to the memory tables, I'm not > sure how I could grab stats. I cant get size of data before ingest since > some of the data may be duplicated. > Are you talking about duplication across nodes due to the replication factor, or because some rows may still be in the memtable? I think what you want to do is bin/nodeprobe flush, bin/nodeprobe compact, wait until the system is idle and then sum the size of everything in your data paths that starts with the name of your column family. Also a general problem we are running into is an easy way to do paging over > the data set (not just rows but columns). Looks like now the API has ways > to do count, but no offset. > Columns can easily be paginated via the 'start' and 'finish' parameters. You can't jump to a random page, but you can provide next/previous behavior. -Brandon