Hi Anthony,
I think, I can explain at least a big chunk of the difference in RAM and
disk consumption you see.
Let start with RAM. I could of course be wrong here, but I believe the
/'static bitcask per key overhead/' is just plainly too small. Let me
explain why.
The bitcask_keydir_entry struct for each entry looks like this:
typedef struct
{
uint32_t file_id;
uint32_t total_sz;
uint64_t offset;
uint32_t tstamp;
uint16_t key_sz;
char key[0];
} bitcask_keydir_entry;
This has indeed a size of 22 bytes (The array 'key' has zero entries
because the key is written to the memory address directly after the
keydir entry).
As is done int the capacity planner, you need to add the size of the
bucket and key to get the size of the keydir entry, but that is not the
whole story.
The thing that is actually stored in key is the result of this Erlang
expression:
erlang:term_to_binary( {<<"bucket">>,<<"key">>} )
that is, a tuple of two binaries converted to the Erlang external term
format.
So lets see:
1> term_to_binary({<<>>,<<>>}).
<<131,104,2,109,0,0,0,0,109,0,0,0,0>>
2> iolist_size(term_to_binary({<<>>,<<>>})).
13
3> iolist_size(term_to_binary({<<"a">>,<<"b">>})).
15
4> iolist_size(term_to_binary({<<"aa">>,<<"b">>})).
16
5> iolist_size(term_to_binary({<<"aa">>,<<"bb">>})).
17
so even an empty bucket/key pair take 13 bytes to store.
Also, since the hashtable storing the keydir entries is essentially an
array of pointers to bitcask_keydir_entry objects, there is another 8
bytes of overhead per key, assuming you are running a 64bit system.
so the real static overhead per key is not 22 but 22+13+8 = 43 bytes.
Lets run the numbers for your predicted memory consumption again:
( 43 + 10 + 36 ) * 183915891 * 3 = 49105542897 = 45.7 GB
Your actual RAM consumption of 70 GB seems to be at odd with the output
of erlang:memory/0 that you sent:
{total,7281790968} => RAM: 7281790968 * 8 = 54.3 GB
So that is much closer, within about 20 percent. Some additional
overhead is to be expected, but it is hard to say how much of that is
due to Erlangs internal usage and how much due to bitcask.
So lets examine the disk consumption next.
As you rightly concluded the equation here
http://wiki.basho.com/Cluster-Capacity-Planning.html is somewhat
simplified, and your are also right, that the real equation would be
( 14 + Key + Value ) * Num Entries * N_Val
On the other hand 14 bytes + keysize might be quite irrelevant if your
values have a size of at least 2KB (as in the example), which seems to
be the general assumption in some aspects of the design of riak and bitcask.
As you also noticed, this additional small overhead brings you nowhere
near the disk usage that you observe.
First, the key that is stored in the bitcask files is not the key part
of the bucket/key pair that riak calls a key, but the serialized
bucket/key pair described above, so the calculation becomes:
( 14 + ( 13 + Bucket + Key) + Value ) * Num Entries * N_Val
( 14 + ( 13 + 10 + 36) + 36 ) * 183915891 * 3 = 56 GB
Still not enough :-/.
So next lets examine what is actually stored as the value in bitcask. It
is not simply the data you provide, but a riak object (r_object record)
which is again serialized by the erlang:term_to_binary/1 function. So
lets see. I create a new riak object with zero byte bucket, key and value:
3> Obj = riak_object:new(<<>>,<<>>,<<>>).
{r_object,<<>>,<<>>,
[{r_content,{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
{{[],[],[],[],[],[],[],[],[],[],[],[],...}}},
<<>>}],
[],
{dict,1,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],...}}},
undefined}
4> iolist_size(erlang:term_to_binary(Obj))._*
205*_
Also, bucket and key are contained int the riak object itself (and
therefore in the bitcask notion of the value). So with this information
the predicted disk usage becomes:
( 14 + ( 13 + Bucket + Key ) + ( 205 + Bucket + Key + Value ) ) * Num Entries *
N_Val
( 14 + ( 13 + 10 + 36) + ( 205 + 10 + 36 ) ) * 183915891 * 3 = 166.5 GB
which is way closer to the 341 GB you observe.
But we can get even closer, although the detailes become somewhat more
fuzzy. But bear with me.
I again create a riak object, but this time with a non empty bucket/key
so I can store it in riak:
([email protected])7> Obj = riak_object:new(<<"a">>,<<"a">>,<<>>).
{r_object,<<"a">>,<<"a">>,
[{r_content,{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
{{[],[],[],[],[],[],[],[],[],[],[],[],...}}},
<<>>}],
[],
{dict,1,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],...}}},
undefined}
([email protected])8> iolist_size(erlang:term_to_binary(Obj)).
_*207*_
([email protected])9> {ok,C}=riak:local_client().
{ok,{riak_client,'[email protected]',<<2,123,179,255>>}}
([email protected])10> C:put(Obj,1,1).
ok
([email protected])12> {ok,ObjStored} = C:get(<<"a">>,<<"a">>, 1).
{ok,{r_object,<<"a">>,<<"a">>,
[{r_content,{dict,2,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],...},
{{[],[],[],[],[],[],[],[],[],[],...}}},
<<>>}],
[{<<2,123,179,255>>,{1,63473554112}}],
{dict,1,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],...},
{{[],[],[],[],[],[],[],[],[],[],[],...}}},
undefined}}
([email protected])13> iolist_size(erlang:term_to_binary(ObjStored)).
_*358*_
Ok? What happened? The object we retrieved is considerably larger than
the one we stored. One culprit is the vector clock data, which was an
empty list for Obj, and now has one entry:
([email protected])14> riak_object:vclock(Obj).
[]
([email protected])15> riak_object:vclock(ObjStored).
[{<<2,123,179,255>>,{1,63473554112}}]
([email protected])23> iolist_size(term_to_binary(riak_object:vclock(Obj))).
2
([email protected])24>
iolist_size(term_to_binary(riak_object:vclock(ObjStored))).
30
So thats 28 bytes each time the object is updated with a new client ID
(so alway use a meaningful client ID!!!!), until the vclock pruning sets
in. The default bucket property is {big_vclock,50}, so in the worst case
this could account for 28*50=1400 byte!
But each object that has been stored somehow has at least one entry in
the vclock, so another 28 bytes of overhead
The other part of the growth stems from some standard entries, which are
added to the object metadata during the put operation:
([email protected])35> dict:to_list(riak_object:get_metadata(Obj)).
[]
([email protected])37>
iolist_size(term_to_binary(riak_object:get_metadata(Obj))).
60
([email protected])36> dict:to_list(riak_object:get_metadata(ObjStored)).
[{<<"X-Riak-VTag">>,"7PoD9FEMUBzNmQeMnjUbas"},
{<<"X-Riak-Last-Modified">>,{1306,334912,424099}}]
([email protected])38>
iolist_size(term_to_binary(riak_object:get_metadata(ObjStored))).
183
So there are the other 123 bytes.
In total this 356 byte* overhead per object leads us to the following
calculation: (* 2 bytes from the above 358 came from the bucket and key
which are already accounted for)
( 14 + ( 13 + Bucket + Key ) + ( 356 + Bucket + Key + Value ) ) * Num Entries *
N_Val
( 14 + ( 13 + 10 + 36) + ( 356 + 10 + 36 ) ) * 183915891 * 3 = 244 GB
We are getting closer!
If you loaded the data via the REST API the overhead is somewhat larger
still, since the object will also contain 'content-type', 'X-Riak-Meta'
and 'Link' metadata entries:
xxxx@node2:~$ curl -v -d '' -H "Content-Type: text/plain"
http://127.0.0.1:8098/riak/a/a
([email protected])44> {ok,ObjStored} = C:get(<<"a">>,<<"a">>, 1).
{ok,{r_object,<<"a">>,<<"a">>,
[{r_content,{dict,5,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],...},
{{[],[],[[<<"Links">>]],[],[],[],[],[],[],[],...}}},
<<>>}],
[{<<5,134,53,93>>,{1,63473557230}}],
{dict,1,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],...},
{{[],[],[],[],[],[],[],[],[],[],[],...}}},
undefined}}
([email protected])45> dict:to_list(riak_object:get_metadata(ObjStored)).
[{<<"Links">>,[]},
{<<"X-Riak-VTag">>,"3TQzJznzXXWtZefntWXPDR"},
{<<"content-type">>,"text/plain"},
{<<"X-Riak-Last-Modified">>,{1306,338030,682871}},
{<<"X-Riak-Meta">>,[]}]
([email protected])46> iolist_size(erlang:term_to_binary(ObjStored))._*
449*_
Which leads to: (remember again to subtract 2 bytes)
( 14 + ( 13 + Bucket + Key ) + ( 447 + Bucket + Key + Value ) ) * Num Entries *
N_Val
( 14 + ( 13 + 10 + 36) + ( 447 + 10 + 36 ) ) * 183915891 * 3 = 290.8 GB
Nearly there!
Now there are also the hintfiles, which are a kind of an index into the
bitcask data files to speedup the start of a riak node. The hintfiles
contain one entry per key and the code that creates one entry looks like
this:
[<<Tstamp:?TSTAMPFIELD>>,<<KeySz:?KEYSIZEFIELD>>,
<<TotalSz:?TOTALSIZEFIELD>>,<<Offset:?OFFSETFIELD>>, Key].
So thats 4 + 2 + 4 + 8 + KeySize (= 18 + KeySize) additonal bytes per key.
So the final result if you inserted the key via the Rest API is:
( 14 + ( 13 + Bucket + Key ) + ( 447 + Bucket + Key + Value ) + (18 + ( 13 +
Bucket + Key ) ) ) * Num Entries * N_Val =*( 505 + 3 * (Bucket + Key) + Value )
* Num Entries * N_Val*
( 505 + 3 * (10 + 36) + 36 ) * 183915891 * 3 = 374636669967 = 348.9 GB
And if you used Erlang (or probably any ProtocolBuffers client):
( 14 + ( 13 + Bucket + Key ) + ( 356 + Bucket + Key + Value ) + (18 + ( 13 +
Bucket + Key ) ) ) * Num Entries * N_Val =*( 414 + 3 * (Bucket + Key) + Value )
* Num Entries * N_Val*
( 414 + 3 * (10 + 36) + 36 ) * 183915891 * 3 = 324427631724 = 302.1 GB
So the truth is somewhere in between. But as David wrote, there can be
additional overhead due to the append only nature on bitcask.
Cheers,
Nico
Am 24.05.2011 23:48, schrieb Anthony Molinaro:
Just curious if anyone has any ideas, for the moment, I'm just taking
the RAM calculation and multiplying by 2 and the Disk calculation and
multiplying by 8, based on my findings with my current cluster. But
I would like to know why my values are so much higher than those I should
be getting.
Also, I'd still like to know how the forms calculate things as the disk
calculation there does not match reality or the formula.
Also, waiting to hear if there is any way to force merge to run so I can
more accurately gauge whether multiple copies are effecting disk usage.
Thanks,
-Anthony
On Mon, May 23, 2011 at 11:06:31PM -0700, Anthony Molinaro wrote:
On Mon, May 23, 2011 at 10:53:29PM -0700, Anthony Molinaro wrote:
On Mon, May 23, 2011 at 09:57:25PM -0600, David Smith wrote:
On Mon, May 23, 2011 at 9:39 PM, Anthony Molinaro
Thus, depending on
your merge triggers, more space can be used than is strictly necessary
to store the data.
So the lack of any overhead in the calculation is expected? I mean
according to http://wiki.basho.com/Cluster-Capacity-Planning.html
Disk = Estimated Total Objects * Average Object Size * n_val
Which just seems wrong, doesn't it? I don't quite understand the
bitcask code well enough yet to see what the actual data it stores is,
but the whitepaper suggested several things were involved in the on
disk representation.
Okay, finally found the code for this part, I kept looking in the nif
but that's only the keydir, not the data files. It looks like
%% Setup io_list for writing -- avoid merging binaries if we can help it
Bytes0 = [<<Tstamp:?TSTAMPFIELD>>,<<KeySz:?KEYSIZEFIELD>>,
<<ValueSz:?VALSIZEFIELD>>, Key, Value],
Bytes = [<<(erlang:crc32(Bytes0)):?CRCSIZEFIELD>> | Bytes0],
And looking at the header, it seems that there's 14 bytes of overhead
(4 for CRC, 4 for timestamp, 2 for keysize, 4 for valsize).
So disk calculation should be
( 14 + Key + Value ) * Num Entries * N_Val
So using my numbers from before that gives
( 14 + 36 + 36 ) * 183915891 * 3 = 47450299878 = 44.1 GB
which actually isn't much closer to 341 GB than the previous calculation :(
So all my questions from the previous email still apply.
-Anthony
--
------------------------------------------------------------------------
Anthony Molinaro<[email protected]>
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com