Hi Steve,

The reason those files are not being merged is because of the point Justin
made earlier. Expiry is not a condition that affects the merge check.

Stated earlier:

That is, bitcask wasn't originally designed around the expiry-centric
way of removing old data, and data that has simply expired (but not
actively been deleted) will not be counted as garbage toward
thresholds or triggers at this time.  It will be cleaned up in a
merge, but will not contribute toward causing the merge in the first
place.  In a use case where you only add items and never actually
delete anything, a merge will never be dynamically triggered.

It is plausible that we could add some expiry-statistics measurement
and triggering to bitcask, but today that's the state of things.  You
could manually trigger merges, but that currently requires a bit of
Erlang.

I hope that this helps.


Thanks,
Dan

Daniel Reverri
Developer Advocate
Basho Technologies, Inc.
[email protected]


On Mon, Jun 13, 2011 at 3:57 PM, Steve Webb <[email protected]> wrote:

> Q: It looks like I have files in my bitcask directories that are not being
> actively used (I've restarted, and they seem to still be a pretty descent
> size still, and the mtime is several days old):
>
>
> root@ha2:/data/riaksearch/bitcask/1027618338748291114361965898003636498195577569280#
> ls -la
> total 791992
> drwxr-xr-x  2 riak riak      4096 2011-06-13 16:49 .
> drwxr-xr-x 34 riak riak      4096 2011-06-13 16:47 ..
> -rw-------  1 riak riak 239369130 2011-06-08 13:11 1307415077.bitcask.data
> -rw-r--r--  1 riak riak   4458434 2011-06-08 13:11 1307415077.bitcask.hint
> -rw-------  1 riak riak 288686080 2011-06-10 13:30 1307562153.bitcask.data
> -rw-r--r--  1 riak riak   5347188 2011-06-10 13:30 1307562153.bitcask.hint
> -rw-------  1 riak riak   1431867 2011-06-08 13:45 1307562333.bitcask.data
> -rw-r--r--  1 riak riak     27162 2011-06-08 13:45 1307562333.bitcask.hint
> -rw-------  1 riak riak 259423130 2011-06-13 15:55 1307862506.bitcask.data
> -rw-r--r--  1 riak riak   9814878 2011-06-13 15:55 1307862506.bitcask.hint
> -rw-------  1 riak riak    950009 2011-06-13 16:22 1308003767.bitcask.data
> -rw-r--r--  1 riak riak     35768 2011-06-13 16:22 1308003767.bitcask.hint
> -rw-------  1 riak riak    561579 2011-06-13 16:55 1308005359.bitcask.data
> -rw-r--r--  1 riak riak     21024 2011-06-13 16:55 1308005359.bitcask.hint
> -rw-------  1 riak riak       107 2011-06-13 16:49 bitcask.write.lock
>
> root@ha2
> :/data/riaksearch/bitcask/1027618338748291114361965898003636498195577569280#
>
> How come these older files aren't being considered for merge?
>
> Again, my bitcast settings are:
>
>
>  %% Bitcask Config
>  {bitcask, [
>             {data_root, "/var/lib/riaksearch/bitcask" },
>             {dead_bytes_merge_trigger, 10242880 },
>             {dead_bytes_threshold, 5242880 },
>             {max_file_size, 80000000 },
>             {expiry_secs, 86400}
>           ]},
>
> - Steve
>
> --
> Steve Webb - Senior System Administrator for gnip.com
> http://twitter.com/GnipWebb
>
> On Mon, 13 Jun 2011, Dan Reverri wrote:
>
>  Hi Steve,
>>
>> The article points out that the active data file is not considered during
>> merge checks. Your 250-ish MB data file is the active file and not
>> considered during the merge check. The file will eventually role over to a
>> non-active file when it hits 2 GB in size. Once the file is not active it
>> will be considered during the merge check and merging will take place.
>>
>> The 2 GB file size is configurable via the max_file_size parameter:
>> https://github.com/basho/bitcask/blob/master/ebin/bitcask.app#L22
>>
>> Thanks,
>> Dan
>>
>> Daniel Reverri
>> Developer Advocate
>> Basho Technologies, Inc.
>> [email protected]
>>
>>
>> On Mon, Jun 13, 2011 at 2:38 PM, Steve Webb <[email protected]> wrote:
>>
>>  Dan -
>>>
>>> I've got dead_bytes_threshold=5242880 (5M) and
>>> dead_bytes_merge_trigger=10242880.  My bitcask *.data files are 250-ish
>>> MB
>>> in size:
>>>
>>> root@ha2
>>> :/data/riaksearch/bitcask/1027618338748291114361965898003636498195577569280#
>>> ls -lah
>>> total 771M
>>> drwxr-xr-x  2 riak riak 4.0K 2011-06-12 01:08 .
>>> drwxr-xr-x 34 riak riak 4.0K 2011-06-12 01:10 ..
>>> -rw-------  1 riak riak 229M 2011-06-08 13:11 1307415077.bitcask.data
>>> -rw-r--r--  1 riak riak 4.3M 2011-06-08 13:11 1307415077.bitcask.hint
>>> -rw-------  1 riak riak 276M 2011-06-10 13:30 1307562153.bitcask.data
>>> -rw-r--r--  1 riak riak 5.1M 2011-06-10 13:30 1307562153.bitcask.hint
>>> -rw-------  1 riak riak 1.4M 2011-06-08 13:45 1307562333.bitcask.data
>>> -rw-r--r--  1 riak riak  27K 2011-06-08 13:45 1307562333.bitcask.hint
>>> -rw-------  1 riak riak 246M 2011-06-13 15:34 1307862506.bitcask.data
>>> -rw-r--r--  1 riak riak 9.4M 2011-06-13 15:34 1307862506.bitcask.hint
>>> -rw-------  1 riak riak  107 2011-06-12 01:08 bitcask.write.lock
>>>
>>> I'm pretty sure that 50% or more of the data in these files should've
>>> aged-off by now and the merge trigger should've happened.  The article
>>> shows
>>> why merges happen when a restart is done, but it doesn't really explain
>>> why
>>> merges don't happen at normal runtime.
>>>
>>> I really don't want to restart riak every day to merge files.
>>>
>>> Q: What are some good trigger settings for my use case?
>>>
>>> I want to collect and store 1 day worth of tweets from the twitter
>>> spritzer
>>> feed and have the data files auto-merge once in a while (once a day or
>>> more
>>> frequently) when they've gotten 10% of 'dead' data in them (aka, the
>>> tweets
>>> expire after 1 day).
>>>
>>>
>>> - Steve
>>>
>>> --
>>> Steve Webb - Senior System Administrator for gnip.com
>>> http://twitter.com/GnipWebb
>>>
>>> On Mon, 13 Jun 2011, Dan Reverri wrote:
>>>
>>>  Hi Steve,
>>>
>>>>
>>>> This Knowledge Base article may be related:
>>>>
>>>>
>>>> https://help.basho.com/entries/20141178-why-does-it-seem-that-bitcask-merging-is-only-triggered-when-a-riak-node-is-restarted
>>>>
>>>> Thanks,
>>>> Dan
>>>>
>>>> Daniel Reverri
>>>> Developer Advocate
>>>> Basho Technologies, Inc.
>>>> [email protected]
>>>>
>>>>
>>>> On Mon, Jun 13, 2011 at 10:25 AM, Steve Webb <[email protected]> wrote:
>>>>
>>>>  Justin -
>>>>
>>>>>
>>>>> My current bitcask settings are:
>>>>>
>>>>>  %% Bitcask Config
>>>>>  {bitcask, [
>>>>>           {data_root, "/var/lib/riaksearch/bitcask" },
>>>>>           {dead_bytes_merge_trigger, 10242880 },
>>>>>           {dead_bytes_threshold, 5242880 },
>>>>>           {expiry_secs, 86400}
>>>>>         ]},
>>>>>
>>>>> My understanding of these settings mean that the data should
>>>>> auto-expire
>>>>> after one day.  Also, once each bitcask file in
>>>>> .../riaksearch/bitcask/xxx/*.data once it has 10M of "dead" or expired
>>>>> data
>>>>> in it, should be merged, right?
>>>>>
>>>>> I'm collecting the spritzer twitter stream and loading it into two
>>>>> buckets
>>>>> (one non-indexed bucket holds the full tweet, one indexed bucket holds
>>>>> the
>>>>> tweet string, id, date and username).  I used to see about 10 GB of
>>>>> data
>>>>> total, but it's growing and currently at 26GB of data total.
>>>>>
>>>>> I'm seeing these in the logs:
>>>>>
>>>>> INFO REPORT==== 13-Jun-2011::08:28:19 ===
>>>>> Pid <0.6844.0> compacted 3 segments for 942232 bytes in 4.900694
>>>>> seconds,
>>>>> 0.18 MB/sec
>>>>>
>>>>> =INFO REPORT==== 13-Jun-2011::08:29:01 ===
>>>>> Pid <0.6267.0> compacted 3 segments for 1721790 bytes in 9.690511
>>>>> seconds,
>>>>> 0.17 MB/sec
>>>>>
>>>>> =INFO REPORT==== 13-Jun-2011::08:31:23 ===
>>>>> Pid <0.6924.0> compacted 3 segments for 6988416 bytes in 44.659753
>>>>> seconds,
>>>>> 0.15 MB/sec
>>>>>
>>>>> ... but I'm not seeing any "merging" related entries.
>>>>>
>>>>>
>>>>> - Steve
>>>>>
>>>>> --
>>>>> Steve Webb - Senior System Administrator for gnip.com
>>>>> http://twitter.com/GnipWebb
>>>>>
>>>>> On Wed, 8 Jun 2011, Justin Sheehy wrote:
>>>>>
>>>>>  Hi, Steve.
>>>>>
>>>>>
>>>>>> Check out this page:
>>>>>>
>>>>>>
>>>>>> http://wiki.basho.com/Bitcask-Configuration.html#Disk-Usage-and-Merging-Settings
>>>>>>
>>>>>> Basically, a "merge trigger" must be met in order to have the merge
>>>>>> process occur.  When it does occur, it will affect all existing files
>>>>>> that
>>>>>> meet a "merge threshold."
>>>>>>
>>>>>> One note that is relevant for your specific use: the expiry_secs
>>>>>> parameter
>>>>>> will cause a given item to disappear from the client API immediately
>>>>>> after
>>>>>> expiry, and to be cleaned if it is in a file already being merged, but
>>>>>> will
>>>>>> not currently contribute toward merge triggers or thresholds on its
>>>>>> own
>>>>>> if
>>>>>> not otherwise "dead".
>>>>>>
>>>>>> -Justin
>>>>>>
>>>>>>
>>>>>> On Jun 7, 2011, at 4:29 PM, Steve Webb wrote:
>>>>>>
>>>>>>  Hello there.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> I'm curious - I'm up to about 10GB of storage and I'm guessing that
>>>>>>> I'll
>>>>>>> be full in 3-4 more days of ingesting data.  I have no idea if/when a
>>>>>>> merge
>>>>>>> will run to expire the older data.
>>>>>>>
>>>>>>> I'm loading a 2-node (1GB mem, 20GB storage, vmware VMs) riaksearch
>>>>>>> cluster with the spritzer twitter feed.  I used the bitcask
>>>>>>> 'expiry_secs' to
>>>>>>> expire data after 3 days. Q: Is there a method or command to force a
>>>>>>> merge
>>>>>>> at any time? Q: Is there a way to run a merge when the storage size
>>>>>>> reaches
>>>>>>> a specific threshold?
>>>>>>>
>>>>>>>
>>>>>>> - Steve
>>>>>>>
>>>>>>> --
>>>>>>> Steve Webb - Senior System Administrator for gnip.com
>>>>>>> http://twitter.com/GnipWebb
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> riak-users mailing list
>>>>>>> [email protected]
>>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>  _______________________________________________
>>>>>>
>>>>> riak-users mailing list
>>>>> [email protected]
>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>>
>>>>>
>>>>>
>>>>
>>
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to