On Mon, Aug 30, 2010 at 10:10 PM, Jonathan Ellis <jbel...@gmail.com> wrote:
> column names are stored per cell > > (moving to user@) > I think that is already accommodated for in my numbers? What i listed was measured from the actual SSTable file (using the output from "strings <sstable.db>), so multiples of the supercolumn and columns names is already part of the strings output. Typically, you get something like this as output from strings: 20100629 20100629 20100629 <string matching the "type"> java.util.BitSetn bitst [Jxpur [Jx repeating. I am not entirely sure why I get those repeating supercolumn names there (there are more supercolumn names in this file than column names, which is not logical, it should be the other way around!), but I will have a closer look at that one. These strings makes up about 1/2 of the total data. The remainder being binary and tons of null bytes. The strings command (which will of course give me some binary noise) returns 14.943.928 bytes (or rather characters) of data If we ignore the binary noise for a second and also count the number of null bytes in this file, we get: Text: 14,943,928 bytes (as mentioned in my previous posting, 9.4MB of this is column headers) Null Bytes: 14,634,412 bytes Other (binary): 8,580,188 bytes Total size: 38,158,528 Yes yes yes, doing this is ugly and lots of null bytes would occur for many reasons (no reason to tell me that), but chew on that number for a second and take a look at an SSTable near you, there is a heck of a lot of nothing there. Should be noted that this is 0.7 beta 1. I realize that this code will change dramatically by 0.8 so this is probably not too interesting to spend too much time on, but the expansion of data is pretty excessive in many scenarios, so I just looked briefely at an actual file trying to understand it a bit better. Terje