On Mon, Aug 30, 2010 at 10:10 PM, Jonathan Ellis <jbel...@gmail.com> wrote:

> column names are stored per cell
>
> (moving to user@)
>


I think that is already accommodated for in my numbers?

What i listed was measured from the actual SSTable file (using the output
from "strings <sstable.db>), so multiples of the supercolumn and columns
names is already part of the strings output.

Typically, you get something like this as output from strings:
20100629
20100629
20100629
<string matching the "type">
java.util.BitSetn
bitst
[Jxpur
[Jx

repeating.

I am not entirely sure why I get those repeating supercolumn names there
(there are more supercolumn names in this file than column names, which is
not logical, it should be the other way around!), but I will have a closer
look at that one.

These strings makes up about 1/2 of the total data. The remainder being
binary and tons of null bytes.

The strings command (which will of course give me some binary noise) returns
14.943.928 bytes (or rather characters) of data
If we ignore the binary noise for a second and also count the number of null
bytes in this file, we get:

Text: 14,943,928 bytes (as mentioned in my previous posting, 9.4MB of this
is column headers)
Null Bytes: 14,634,412 bytes
Other (binary): 8,580,188 bytes
Total size: 38,158,528

Yes yes yes, doing this is ugly and lots of null bytes would occur for many
reasons (no reason to tell me that), but chew on that number for a second
and take a look at an SSTable near you, there is a heck of a lot of nothing
there.

Should be noted that this is 0.7 beta 1.

I realize that this code will change dramatically by 0.8 so this is probably
not too interesting to spend too much time on,  but the expansion of data is
pretty excessive in many scenarios, so I just looked briefely at an actual
file trying to understand it a bit better.

Terje

Reply via email to