Hi Anthony,

I think, I can explain at least a big chunk of the difference in RAM and disk consumption you see.

Let start with RAM. I could of course be wrong here, but I believe the /'static bitcask per key overhead/' is just plainly too small. Let me explain why.
The bitcask_keydir_entry struct for each entry looks like this:

typedef struct
{
    uint32_t file_id;
    uint32_t total_sz;
    uint64_t offset;
    uint32_t tstamp;
    uint16_t key_sz;
    char     key[0];
} bitcask_keydir_entry;


This has indeed a size of 22 bytes (The array 'key' has zero entries because the key is written to the memory address directly after the keydir entry). As is done int the capacity planner, you need to add the size of the bucket and key to get the size of the keydir entry, but that is not the whole story.

The thing that is actually stored in key is the result of this Erlang expression:

   erlang:term_to_binary( {<<"bucket">>,<<"key">>} )

that is, a tuple of two binaries converted to the Erlang external term format.

So lets see:

1>  term_to_binary({<<>>,<<>>}).
<<131,104,2,109,0,0,0,0,109,0,0,0,0>>
2>  iolist_size(term_to_binary({<<>>,<<>>})).
13
3>  iolist_size(term_to_binary({<<"a">>,<<"b">>})).
15
4>  iolist_size(term_to_binary({<<"aa">>,<<"b">>})).
16
5>  iolist_size(term_to_binary({<<"aa">>,<<"bb">>})).
17

so even an empty bucket/key pair take 13 bytes  to store.

Also, since the hashtable storing the keydir entries is essentially an array of pointers to bitcask_keydir_entry objects, there is another 8 bytes of overhead per key, assuming you are running a 64bit system.

so the real static overhead per key is not 22 but 22+13+8 = 43 bytes.

Lets run the numbers for your predicted memory consumption again:

  ( 43 + 10 + 36 ) * 183915891 * 3 = 49105542897 = 45.7 GB


Your actual RAM consumption of 70 GB seems to be at odd with the output of erlang:memory/0 that you sent:

{total,7281790968} =>   RAM: 7281790968 * 8 = 54.3 GB


So that is much closer, within about 20 percent. Some additional overhead is to be expected, but it is hard to say how much of that is due to Erlangs internal usage and how much due to bitcask.

So lets examine the disk consumption next.
As you rightly concluded the equation here http://wiki.basho.com/Cluster-Capacity-Planning.html is somewhat simplified, and your are also right, that the real equation would be

( 14 + Key + Value ) * Num Entries * N_Val

On the other hand 14 bytes + keysize might be quite irrelevant if your values have a size of at least 2KB (as in the example), which seems to be the general assumption in some aspects of the design of riak and bitcask. As you also noticed, this additional small overhead brings you nowhere near the disk usage that you observe.

First, the key that is stored in the bitcask files is not the key part of the bucket/key pair that riak calls a key, but the serialized bucket/key pair described above, so the calculation becomes:

( 14 + ( 13 + Bucket + Key) + Value ) * Num Entries * N_Val

( 14 + ( 13 + 10 + 36) + 36 ) * 183915891 * 3 = 56 GB

Still not enough :-/.
So next lets examine what is actually stored as the value in bitcask. It is not simply the data you provide, but a riak object (r_object record) which is again serialized by the erlang:term_to_binary/1 function. So lets see. I create a new riak object with zero byte bucket, key and value:

3> Obj = riak_object:new(<<>>,<<>>,<<>>). {r_object,<<>>,<<>>,
          [{r_content,{dict,0,16,16,8,80,48,
                            {[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
                            {{[],[],[],[],[],[],[],[],[],[],[],[],...}}},
                      <<>>}],
          [],
          {dict,1,16,16,8,80,48,
                {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
                {{[],[],[],[],[],[],[],[],[],[],[],[],[],...}}},
          undefined}
4>  iolist_size(erlang:term_to_binary(Obj))._*
205*_

Also, bucket and key are contained int the riak object itself (and therefore in the bitcask notion of the value). So with this information the predicted disk usage becomes:

( 14 + ( 13 + Bucket + Key ) + ( 205 + Bucket + Key + Value ) ) * Num Entries * 
N_Val

( 14 + ( 13 + 10 + 36) + ( 205 + 10 + 36 ) ) * 183915891 * 3 = 166.5 GB

which is way closer to the 341 GB you observe.

But we can get even closer, although the detailes become somewhat more fuzzy. But bear with me. I again create a riak object, but this time with a non empty bucket/key so I can store it in riak:

([email protected])7>  Obj = riak_object:new(<<"a">>,<<"a">>,<<>>).
{r_object,<<"a">>,<<"a">>,
          [{r_content,{dict,0,16,16,8,80,48,
                            {[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
                            {{[],[],[],[],[],[],[],[],[],[],[],[],...}}},
                      <<>>}],
          [],
          {dict,1,16,16,8,80,48,
                {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
                {{[],[],[],[],[],[],[],[],[],[],[],[],[],...}}},
          undefined}

([email protected])8>  iolist_size(erlang:term_to_binary(Obj)).
_*207*_

([email protected])9>  {ok,C}=riak:local_client().
{ok,{riak_client,'[email protected]',<<2,123,179,255>>}}
([email protected])10> C:put(Obj,1,1). ok

([email protected])12>  {ok,ObjStored} = C:get(<<"a">>,<<"a">>, 1).
{ok,{r_object,<<"a">>,<<"a">>,
         [{r_content,{dict,2,16,16,8,80,48,
                         {[],[],[],[],[],[],[],[],[],[],[],[],...},
                         {{[],[],[],[],[],[],[],[],[],[],...}}},
                     <<>>}],
              [{<<2,123,179,255>>,{1,63473554112}}],
              {dict,1,16,16,8,80,48,
                    {[],[],[],[],[],[],[],[],[],[],[],[],[],...},
                    {{[],[],[],[],[],[],[],[],[],[],[],...}}},
               undefined}}
([email protected])13>  iolist_size(erlang:term_to_binary(ObjStored)).
_*358*_



Ok? What happened? The object we retrieved is considerably larger than the one we stored. One culprit is the vector clock data, which was an empty list for Obj, and now has one entry:

([email protected])14>  riak_object:vclock(Obj).
[]
([email protected])15>  riak_object:vclock(ObjStored).
[{<<2,123,179,255>>,{1,63473554112}}]
([email protected])23> iolist_size(term_to_binary(riak_object:vclock(Obj))). 2
([email protected])24>  
iolist_size(term_to_binary(riak_object:vclock(ObjStored))).
30

So thats 28 bytes each time the object is updated with a new client ID (so alway use a meaningful client ID!!!!), until the vclock pruning sets in. The default bucket property is {big_vclock,50}, so in the worst case this could account for 28*50=1400 byte! But each object that has been stored somehow has at least one entry in the vclock, so another 28 bytes of overhead

The other part of the growth stems from some standard entries, which are added to the object metadata during the put operation:

([email protected])35>  dict:to_list(riak_object:get_metadata(Obj)).
[]
([email protected])37>  
iolist_size(term_to_binary(riak_object:get_metadata(Obj))).
60

([email protected])36>  dict:to_list(riak_object:get_metadata(ObjStored)).
[{<<"X-Riak-VTag">>,"7PoD9FEMUBzNmQeMnjUbas"},
 {<<"X-Riak-Last-Modified">>,{1306,334912,424099}}]
([email protected])38>  
iolist_size(term_to_binary(riak_object:get_metadata(ObjStored))).
183

So there are the other 123 bytes.

In total this 356 byte* overhead per object leads us to the following calculation: (* 2 bytes from the above 358 came from the bucket and key which are already accounted for)

( 14 + ( 13 + Bucket + Key ) + ( 356 + Bucket + Key + Value ) ) * Num Entries * 
N_Val

( 14 + ( 13 + 10 + 36) + ( 356 + 10 + 36 ) ) * 183915891 * 3 = 244 GB


We are getting closer!
If you loaded the data via the REST API the overhead is somewhat larger still, since the object will also contain 'content-type', 'X-Riak-Meta' and 'Link' metadata entries:

xxxx@node2:~$ curl -v -d '' -H "Content-Type: text/plain" 
http://127.0.0.1:8098/riak/a/a


([email protected])44>  {ok,ObjStored} = C:get(<<"a">>,<<"a">>, 1).
{ok,{r_object,<<"a">>,<<"a">>,
              [{r_content,{dict,5,16,16,8,80,48,
                                {[],[],[],[],[],[],[],[],[],[],[],[],...},
                                
{{[],[],[[<<"Links">>]],[],[],[],[],[],[],[],...}}},
                          <<>>}],
              [{<<5,134,53,93>>,{1,63473557230}}],
              {dict,1,16,16,8,80,48,
                    {[],[],[],[],[],[],[],[],[],[],[],[],[],...},
                    {{[],[],[],[],[],[],[],[],[],[],[],...}}},
              undefined}}
([email protected])45> dict:to_list(riak_object:get_metadata(ObjStored)). [{<<"Links">>,[]},
 {<<"X-Riak-VTag">>,"3TQzJznzXXWtZefntWXPDR"},
 {<<"content-type">>,"text/plain"},
 {<<"X-Riak-Last-Modified">>,{1306,338030,682871}},
 {<<"X-Riak-Meta">>,[]}]

([email protected])46>  iolist_size(erlang:term_to_binary(ObjStored))._*
449*_


Which leads to: (remember again to subtract 2 bytes)

( 14 + ( 13 + Bucket + Key ) + ( 447 + Bucket + Key + Value ) ) * Num Entries * 
N_Val

( 14 + ( 13 + 10 + 36) + ( 447 + 10 + 36 ) ) * 183915891 * 3 = 290.8 GB


Nearly there!

Now there are also the hintfiles, which are a kind of an index into the bitcask data files to speedup the start of a riak node. The hintfiles contain one entry per key and the code that creates one entry looks like this:

    [<<Tstamp:?TSTAMPFIELD>>,<<KeySz:?KEYSIZEFIELD>>,
     <<TotalSz:?TOTALSIZEFIELD>>,<<Offset:?OFFSETFIELD>>, Key].


So thats 4 + 2 + 4 + 8 + KeySize (= 18 + KeySize) additonal bytes per key.
So the final result if you inserted the key via the Rest API is:

( 14 + ( 13 + Bucket + Key ) + ( 447 + Bucket + Key + Value ) + (18 + ( 13 + 
Bucket + Key ) ) ) * Num Entries * N_Val =*( 505 + 3 * (Bucket + Key) + Value ) 
* Num Entries * N_Val*

( 505 + 3 * (10 + 36) + 36 ) * 183915891 * 3 = 374636669967 = 348.9 GB


And if you used Erlang (or probably any ProtocolBuffers client):

( 14 + ( 13 + Bucket + Key ) + ( 356 + Bucket + Key + Value ) + (18 + ( 13 + 
Bucket + Key ) ) ) * Num Entries * N_Val =*( 414 + 3 * (Bucket + Key) + Value ) 
* Num Entries * N_Val*

( 414 + 3 * (10 + 36) + 36 ) * 183915891 * 3 = 324427631724 = 302.1 GB


So the truth is somewhere in between. But as David wrote, there can be additional overhead due to the append only nature on bitcask.

Cheers,
Nico

Am 24.05.2011 23:48, schrieb Anthony Molinaro:
Just curious if anyone has any ideas, for the moment, I'm just taking
the RAM calculation and multiplying by 2 and the Disk calculation and
multiplying by 8, based on my findings with my current cluster.  But
I would like to know why my values are so much higher than those I should
be getting.

Also, I'd still like to know how the forms calculate things as the disk
calculation there does not match reality or the formula.

Also, waiting to hear if there is any way to force merge to run so I can
more accurately gauge whether multiple copies are effecting disk usage.

Thanks,

-Anthony

On Mon, May 23, 2011 at 11:06:31PM -0700, Anthony Molinaro wrote:
On Mon, May 23, 2011 at 10:53:29PM -0700, Anthony Molinaro wrote:
On Mon, May 23, 2011 at 09:57:25PM -0600, David Smith wrote:
On Mon, May 23, 2011 at 9:39 PM, Anthony Molinaro
Thus, depending on
your merge triggers, more space can be used than is strictly necessary
to store the data.
So the lack of any overhead in the calculation is expected?  I mean
according to http://wiki.basho.com/Cluster-Capacity-Planning.html

Disk = Estimated Total Objects * Average Object Size * n_val

Which just seems wrong, doesn't it?  I don't quite understand the
bitcask code well enough yet to see what the actual data it stores is,
but the whitepaper suggested several things were involved in the on
disk representation.
Okay, finally found the code for this part, I kept looking in the nif
but that's only the keydir, not the data files.  It looks like

    %% Setup io_list for writing -- avoid merging binaries if we can help it
    Bytes0 = [<<Tstamp:?TSTAMPFIELD>>,<<KeySz:?KEYSIZEFIELD>>,
              <<ValueSz:?VALSIZEFIELD>>, Key, Value],
    Bytes  = [<<(erlang:crc32(Bytes0)):?CRCSIZEFIELD>>  | Bytes0],

And looking at the header, it seems that there's 14 bytes of overhead
(4 for CRC, 4 for timestamp, 2 for keysize, 4 for valsize).

So disk calculation should be

( 14 + Key + Value ) * Num Entries * N_Val

So using my numbers from before that gives

( 14 + 36 + 36 ) * 183915891 * 3 = 47450299878 = 44.1 GB

which actually isn't much closer to 341 GB than the previous calculation :(

So all my questions from the previous email still apply.

-Anthony

--
------------------------------------------------------------------------
Anthony Molinaro<[email protected]>

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to