Re: Issues with capacity planning pages on wiki

Nico Meyer Wed, 25 May 2011 09:23:08 -0700

Hi Anthony,

I think, I can explain at least a big chunk of the difference in RAM anddisk consumption you see.

Let start with RAM. I could of course be wrong here, but I believe the/'static bitcask per key overhead/' is just plainly too small. Let meexplain why.

The bitcask_keydir_entry struct for each entry looks like this:

typedef struct
{
    uint32_t file_id;
    uint32_t total_sz;
    uint64_t offset;
    uint32_t tstamp;
    uint16_t key_sz;
    char     key[0];
} bitcask_keydir_entry;

This has indeed a size of 22 bytes (The array 'key' has zero entriesbecause the key is written to the memory address directly after thekeydir entry).As is done int the capacity planner, you need to add the size of thebucket and key to get the size of the keydir entry, but that is not thewhole story.

The thing that is actually stored in key is the result of this Erlangexpression:


   erlang:term_to_binary( {<<"bucket">>,<<"key">>} )

that is, a tuple of two binaries converted to the Erlang external termformat.


So lets see:

1>  term_to_binary({<<>>,<<>>}).
<<131,104,2,109,0,0,0,0,109,0,0,0,0>>
2>  iolist_size(term_to_binary({<<>>,<<>>})).
13
3>  iolist_size(term_to_binary({<<"a">>,<<"b">>})).
15
4>  iolist_size(term_to_binary({<<"aa">>,<<"b">>})).
16
5>  iolist_size(term_to_binary({<<"aa">>,<<"bb">>})).
17

so even an empty bucket/key pair take 13 bytes  to store.

Also, since the hashtable storing the keydir entries is essentially anarray of pointers to bitcask_keydir_entry objects, there is another 8bytes of overhead per key, assuming you are running a 64bit system.


so the real static overhead per key is not 22 but 22+13+8 = 43 bytes.

Lets run the numbers for your predicted memory consumption again:

  ( 43 + 10 + 36 ) * 183915891 * 3 = 49105542897 = 45.7 GB

Your actual RAM consumption of 70 GB seems to be at odd with the outputof erlang:memory/0 that you sent:


{total,7281790968} =>   RAM: 7281790968 * 8 = 54.3 GB

So that is much closer, within about 20 percent. Some additionaloverhead is to be expected, but it is hard to say how much of that isdue to Erlangs internal usage and how much due to bitcask.


So lets examine the disk consumption next.

As you rightly concluded the equation herehttp://wiki.basho.com/Cluster-Capacity-Planning.html is somewhatsimplified, and your are also right, that the real equation would be


( 14 + Key + Value ) * Num Entries * N_Val

On the other hand 14 bytes + keysize might be quite irrelevant if yourvalues have a size of at least 2KB (as in the example), which seems tobe the general assumption in some aspects of the design of riak and bitcask.As you also noticed, this additional small overhead brings you nowherenear the disk usage that you observe.

First, the key that is stored in the bitcask files is not the key partof the bucket/key pair that riak calls a key, but the serializedbucket/key pair described above, so the calculation becomes:


( 14 + ( 13 + Bucket + Key) + Value ) * Num Entries * N_Val

( 14 + ( 13 + 10 + 36) + 36 ) * 183915891 * 3 = 56 GB

Still not enough :-/.

So next lets examine what is actually stored as the value in bitcask. Itis not simply the data you provide, but a riak object (r_object record)which is again serialized by the erlang:term_to_binary/1 function. Solets see. I create a new riak object with zero byte bucket, key and value:

3> Obj = riak_object:new(<<>>,<<>>,<<>>).{r_object,<<>>,<<>>,

          [{r_content,{dict,0,16,16,8,80,48,
                            {[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
                            {{[],[],[],[],[],[],[],[],[],[],[],[],...}}},
                      <<>>}],
          [],
          {dict,1,16,16,8,80,48,
                {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
                {{[],[],[],[],[],[],[],[],[],[],[],[],[],...}}},
          undefined}
4>  iolist_size(erlang:term_to_binary(Obj))._*
205*_

Also, bucket and key are contained int the riak object itself (andtherefore in the bitcask notion of the value). So with this informationthe predicted disk usage becomes:


( 14 + ( 13 + Bucket + Key ) + ( 205 + Bucket + Key + Value ) ) * Num Entries * 
N_Val

( 14 + ( 13 + 10 + 36) + ( 205 + 10 + 36 ) ) * 183915891 * 3 = 166.5 GB

which is way closer to the 341 GB you observe.

But we can get even closer, although the detailes become somewhat morefuzzy. But bear with me.I again create a riak object, but this time with a non empty bucket/keyso I can store it in riak:


([email protected])7>  Obj = riak_object:new(<<"a">>,<<"a">>,<<>>).
{r_object,<<"a">>,<<"a">>,
          [{r_content,{dict,0,16,16,8,80,48,
                            {[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
                            {{[],[],[],[],[],[],[],[],[],[],[],[],...}}},
                      <<>>}],
          [],
          {dict,1,16,16,8,80,48,
                {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
                {{[],[],[],[],[],[],[],[],[],[],[],[],[],...}}},
          undefined}

([email protected])8>  iolist_size(erlang:term_to_binary(Obj)).
_*207*_

([email protected])9>  {ok,C}=riak:local_client().
{ok,{riak_client,'[email protected]',<<2,123,179,255>>}}

([email protected])10> C:put(Obj,1,1).ok


([email protected])12>  {ok,ObjStored} = C:get(<<"a">>,<<"a">>, 1).
{ok,{r_object,<<"a">>,<<"a">>,
         [{r_content,{dict,2,16,16,8,80,48,
                         {[],[],[],[],[],[],[],[],[],[],[],[],...},
                         {{[],[],[],[],[],[],[],[],[],[],...}}},
                     <<>>}],
              [{<<2,123,179,255>>,{1,63473554112}}],
              {dict,1,16,16,8,80,48,
                    {[],[],[],[],[],[],[],[],[],[],[],[],[],...},
                    {{[],[],[],[],[],[],[],[],[],[],[],...}}},
               undefined}}
([email protected])13>  iolist_size(erlang:term_to_binary(ObjStored)).
_*358*_

Ok? What happened? The object we retrieved is considerably larger thanthe one we stored. One culprit is the vector clock data, which was anempty list for Obj, and now has one entry:


([email protected])14>  riak_object:vclock(Obj).
[]
([email protected])15>  riak_object:vclock(ObjStored).
[{<<2,123,179,255>>,{1,63473554112}}]

([email protected])23> iolist_size(term_to_binary(riak_object:vclock(Obj))).2

([email protected])24>  
iolist_size(term_to_binary(riak_object:vclock(ObjStored))).
30

So thats 28 bytes each time the object is updated with a new client ID(so alway use a meaningful client ID!!!!), until the vclock pruning setsin. The default bucket property is {big_vclock,50}, so in the worst casethis could account for 28*50=1400 byte!But each object that has been stored somehow has at least one entry inthe vclock, so another 28 bytes of overhead

The other part of the growth stems from some standard entries, which areadded to the object metadata during the put operation:


([email protected])35>  dict:to_list(riak_object:get_metadata(Obj)).
[]
([email protected])37>  
iolist_size(term_to_binary(riak_object:get_metadata(Obj))).
60

([email protected])36>  dict:to_list(riak_object:get_metadata(ObjStored)).
[{<<"X-Riak-VTag">>,"7PoD9FEMUBzNmQeMnjUbas"},
 {<<"X-Riak-Last-Modified">>,{1306,334912,424099}}]
([email protected])38>  
iolist_size(term_to_binary(riak_object:get_metadata(ObjStored))).
183

So there are the other 123 bytes.

In total this 356 byte* overhead per object leads us to the followingcalculation: (* 2 bytes from the above 358 came from the bucket and keywhich are already accounted for)


( 14 + ( 13 + Bucket + Key ) + ( 356 + Bucket + Key + Value ) ) * Num Entries * 
N_Val

( 14 + ( 13 + 10 + 36) + ( 356 + 10 + 36 ) ) * 183915891 * 3 = 244 GB


We are getting closer!

If you loaded the data via the REST API the overhead is somewhat largerstill, since the object will also contain 'content-type', 'X-Riak-Meta'and 'Link' metadata entries:


xxxx@node2:~$ curl -v -d '' -H "Content-Type: text/plain" 
http://127.0.0.1:8098/riak/a/a


([email protected])44>  {ok,ObjStored} = C:get(<<"a">>,<<"a">>, 1).
{ok,{r_object,<<"a">>,<<"a">>,
              [{r_content,{dict,5,16,16,8,80,48,
                                {[],[],[],[],[],[],[],[],[],[],[],[],...},
                                
{{[],[],[[<<"Links">>]],[],[],[],[],[],[],[],...}}},
                          <<>>}],
              [{<<5,134,53,93>>,{1,63473557230}}],
              {dict,1,16,16,8,80,48,
                    {[],[],[],[],[],[],[],[],[],[],[],[],[],...},
                    {{[],[],[],[],[],[],[],[],[],[],[],...}}},
              undefined}}

([email protected])45> dict:to_list(riak_object:get_metadata(ObjStored)).[{<<"Links">>,[]},

 {<<"X-Riak-VTag">>,"3TQzJznzXXWtZefntWXPDR"},
 {<<"content-type">>,"text/plain"},
 {<<"X-Riak-Last-Modified">>,{1306,338030,682871}},
 {<<"X-Riak-Meta">>,[]}]

([email protected])46>  iolist_size(erlang:term_to_binary(ObjStored))._*
449*_


Which leads to: (remember again to subtract 2 bytes)

( 14 + ( 13 + Bucket + Key ) + ( 447 + Bucket + Key + Value ) ) * Num Entries * 
N_Val

( 14 + ( 13 + 10 + 36) + ( 447 + 10 + 36 ) ) * 183915891 * 3 = 290.8 GB


Nearly there!

Now there are also the hintfiles, which are a kind of an index into thebitcask data files to speedup the start of a riak node. The hintfilescontain one entry per key and the code that creates one entry looks likethis:


    [<<Tstamp:?TSTAMPFIELD>>,<<KeySz:?KEYSIZEFIELD>>,
     <<TotalSz:?TOTALSIZEFIELD>>,<<Offset:?OFFSETFIELD>>, Key].


So thats 4 + 2 + 4 + 8 + KeySize (= 18 + KeySize) additonal bytes per key.
So the final result if you inserted the key via the Rest API is:

( 14 + ( 13 + Bucket + Key ) + ( 447 + Bucket + Key + Value ) + (18 + ( 13 + 
Bucket + Key ) ) ) * Num Entries * N_Val =*( 505 + 3 * (Bucket + Key) + Value ) 
* Num Entries * N_Val*

( 505 + 3 * (10 + 36) + 36 ) * 183915891 * 3 = 374636669967 = 348.9 GB


And if you used Erlang (or probably any ProtocolBuffers client):

( 14 + ( 13 + Bucket + Key ) + ( 356 + Bucket + Key + Value ) + (18 + ( 13 + 
Bucket + Key ) ) ) * Num Entries * N_Val =*( 414 + 3 * (Bucket + Key) + Value ) 
* Num Entries * N_Val*

( 414 + 3 * (10 + 36) + 36 ) * 183915891 * 3 = 324427631724 = 302.1 GB

So the truth is somewhere in between. But as David wrote, there can beadditional overhead due to the append only nature on bitcask.


Cheers,
Nico

Am 24.05.2011 23:48, schrieb Anthony Molinaro:

Just curious if anyone has any ideas, for the moment, I'm just taking
the RAM calculation and multiplying by 2 and the Disk calculation and
multiplying by 8, based on my findings with my current cluster.  But
I would like to know why my values are so much higher than those I should
be getting.

Also, I'd still like to know how the forms calculate things as the disk
calculation there does not match reality or the formula.

Also, waiting to hear if there is any way to force merge to run so I can
more accurately gauge whether multiple copies are effecting disk usage.

Thanks,

-Anthony

On Mon, May 23, 2011 at 11:06:31PM -0700, Anthony Molinaro wrote:

On Mon, May 23, 2011 at 10:53:29PM -0700, Anthony Molinaro wrote:

On Mon, May 23, 2011 at 09:57:25PM -0600, David Smith wrote:

On Mon, May 23, 2011 at 9:39 PM, Anthony Molinaro
Thus, depending on
your merge triggers, more space can be used than is strictly necessary
to store the data.

So the lack of any overhead in the calculation is expected?  I mean
according to http://wiki.basho.com/Cluster-Capacity-Planning.html

Disk = Estimated Total Objects * Average Object Size * n_val

Which just seems wrong, doesn't it?  I don't quite understand the
bitcask code well enough yet to see what the actual data it stores is,
but the whitepaper suggested several things were involved in the on
disk representation.

Okay, finally found the code for this part, I kept looking in the nif
but that's only the keydir, not the data files.  It looks like

    %% Setup io_list for writing -- avoid merging binaries if we can help it
    Bytes0 = [<<Tstamp:?TSTAMPFIELD>>,<<KeySz:?KEYSIZEFIELD>>,
              <<ValueSz:?VALSIZEFIELD>>, Key, Value],
    Bytes  = [<<(erlang:crc32(Bytes0)):?CRCSIZEFIELD>>  | Bytes0],

And looking at the header, it seems that there's 14 bytes of overhead
(4 for CRC, 4 for timestamp, 2 for keysize, 4 for valsize).

So disk calculation should be

( 14 + Key + Value ) * Num Entries * N_Val

So using my numbers from before that gives

( 14 + 36 + 36 ) * 183915891 * 3 = 47450299878 = 44.1 GB

which actually isn't much closer to 341 GB than the previous calculation :(

So all my questions from the previous email still apply.

-Anthony

--
------------------------------------------------------------------------
Anthony Molinaro<[email protected]>

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Issues with capacity planning pages on wiki

Reply via email to