Hi Steve, The reason those files are not being merged is because of the point Justin made earlier. Expiry is not a condition that affects the merge check.
Stated earlier: That is, bitcask wasn't originally designed around the expiry-centric way of removing old data, and data that has simply expired (but not actively been deleted) will not be counted as garbage toward thresholds or triggers at this time. It will be cleaned up in a merge, but will not contribute toward causing the merge in the first place. In a use case where you only add items and never actually delete anything, a merge will never be dynamically triggered. It is plausible that we could add some expiry-statistics measurement and triggering to bitcask, but today that's the state of things. You could manually trigger merges, but that currently requires a bit of Erlang. I hope that this helps. Thanks, Dan Daniel Reverri Developer Advocate Basho Technologies, Inc. [email protected] On Mon, Jun 13, 2011 at 3:57 PM, Steve Webb <[email protected]> wrote: > Q: It looks like I have files in my bitcask directories that are not being > actively used (I've restarted, and they seem to still be a pretty descent > size still, and the mtime is several days old): > > > root@ha2:/data/riaksearch/bitcask/1027618338748291114361965898003636498195577569280# > ls -la > total 791992 > drwxr-xr-x 2 riak riak 4096 2011-06-13 16:49 . > drwxr-xr-x 34 riak riak 4096 2011-06-13 16:47 .. > -rw------- 1 riak riak 239369130 2011-06-08 13:11 1307415077.bitcask.data > -rw-r--r-- 1 riak riak 4458434 2011-06-08 13:11 1307415077.bitcask.hint > -rw------- 1 riak riak 288686080 2011-06-10 13:30 1307562153.bitcask.data > -rw-r--r-- 1 riak riak 5347188 2011-06-10 13:30 1307562153.bitcask.hint > -rw------- 1 riak riak 1431867 2011-06-08 13:45 1307562333.bitcask.data > -rw-r--r-- 1 riak riak 27162 2011-06-08 13:45 1307562333.bitcask.hint > -rw------- 1 riak riak 259423130 2011-06-13 15:55 1307862506.bitcask.data > -rw-r--r-- 1 riak riak 9814878 2011-06-13 15:55 1307862506.bitcask.hint > -rw------- 1 riak riak 950009 2011-06-13 16:22 1308003767.bitcask.data > -rw-r--r-- 1 riak riak 35768 2011-06-13 16:22 1308003767.bitcask.hint > -rw------- 1 riak riak 561579 2011-06-13 16:55 1308005359.bitcask.data > -rw-r--r-- 1 riak riak 21024 2011-06-13 16:55 1308005359.bitcask.hint > -rw------- 1 riak riak 107 2011-06-13 16:49 bitcask.write.lock > > root@ha2 > :/data/riaksearch/bitcask/1027618338748291114361965898003636498195577569280# > > How come these older files aren't being considered for merge? > > Again, my bitcast settings are: > > > %% Bitcask Config > {bitcask, [ > {data_root, "/var/lib/riaksearch/bitcask" }, > {dead_bytes_merge_trigger, 10242880 }, > {dead_bytes_threshold, 5242880 }, > {max_file_size, 80000000 }, > {expiry_secs, 86400} > ]}, > > - Steve > > -- > Steve Webb - Senior System Administrator for gnip.com > http://twitter.com/GnipWebb > > On Mon, 13 Jun 2011, Dan Reverri wrote: > > Hi Steve, >> >> The article points out that the active data file is not considered during >> merge checks. Your 250-ish MB data file is the active file and not >> considered during the merge check. The file will eventually role over to a >> non-active file when it hits 2 GB in size. Once the file is not active it >> will be considered during the merge check and merging will take place. >> >> The 2 GB file size is configurable via the max_file_size parameter: >> https://github.com/basho/bitcask/blob/master/ebin/bitcask.app#L22 >> >> Thanks, >> Dan >> >> Daniel Reverri >> Developer Advocate >> Basho Technologies, Inc. >> [email protected] >> >> >> On Mon, Jun 13, 2011 at 2:38 PM, Steve Webb <[email protected]> wrote: >> >> Dan - >>> >>> I've got dead_bytes_threshold=5242880 (5M) and >>> dead_bytes_merge_trigger=10242880. My bitcask *.data files are 250-ish >>> MB >>> in size: >>> >>> root@ha2 >>> :/data/riaksearch/bitcask/1027618338748291114361965898003636498195577569280# >>> ls -lah >>> total 771M >>> drwxr-xr-x 2 riak riak 4.0K 2011-06-12 01:08 . >>> drwxr-xr-x 34 riak riak 4.0K 2011-06-12 01:10 .. >>> -rw------- 1 riak riak 229M 2011-06-08 13:11 1307415077.bitcask.data >>> -rw-r--r-- 1 riak riak 4.3M 2011-06-08 13:11 1307415077.bitcask.hint >>> -rw------- 1 riak riak 276M 2011-06-10 13:30 1307562153.bitcask.data >>> -rw-r--r-- 1 riak riak 5.1M 2011-06-10 13:30 1307562153.bitcask.hint >>> -rw------- 1 riak riak 1.4M 2011-06-08 13:45 1307562333.bitcask.data >>> -rw-r--r-- 1 riak riak 27K 2011-06-08 13:45 1307562333.bitcask.hint >>> -rw------- 1 riak riak 246M 2011-06-13 15:34 1307862506.bitcask.data >>> -rw-r--r-- 1 riak riak 9.4M 2011-06-13 15:34 1307862506.bitcask.hint >>> -rw------- 1 riak riak 107 2011-06-12 01:08 bitcask.write.lock >>> >>> I'm pretty sure that 50% or more of the data in these files should've >>> aged-off by now and the merge trigger should've happened. The article >>> shows >>> why merges happen when a restart is done, but it doesn't really explain >>> why >>> merges don't happen at normal runtime. >>> >>> I really don't want to restart riak every day to merge files. >>> >>> Q: What are some good trigger settings for my use case? >>> >>> I want to collect and store 1 day worth of tweets from the twitter >>> spritzer >>> feed and have the data files auto-merge once in a while (once a day or >>> more >>> frequently) when they've gotten 10% of 'dead' data in them (aka, the >>> tweets >>> expire after 1 day). >>> >>> >>> - Steve >>> >>> -- >>> Steve Webb - Senior System Administrator for gnip.com >>> http://twitter.com/GnipWebb >>> >>> On Mon, 13 Jun 2011, Dan Reverri wrote: >>> >>> Hi Steve, >>> >>>> >>>> This Knowledge Base article may be related: >>>> >>>> >>>> https://help.basho.com/entries/20141178-why-does-it-seem-that-bitcask-merging-is-only-triggered-when-a-riak-node-is-restarted >>>> >>>> Thanks, >>>> Dan >>>> >>>> Daniel Reverri >>>> Developer Advocate >>>> Basho Technologies, Inc. >>>> [email protected] >>>> >>>> >>>> On Mon, Jun 13, 2011 at 10:25 AM, Steve Webb <[email protected]> wrote: >>>> >>>> Justin - >>>> >>>>> >>>>> My current bitcask settings are: >>>>> >>>>> %% Bitcask Config >>>>> {bitcask, [ >>>>> {data_root, "/var/lib/riaksearch/bitcask" }, >>>>> {dead_bytes_merge_trigger, 10242880 }, >>>>> {dead_bytes_threshold, 5242880 }, >>>>> {expiry_secs, 86400} >>>>> ]}, >>>>> >>>>> My understanding of these settings mean that the data should >>>>> auto-expire >>>>> after one day. Also, once each bitcask file in >>>>> .../riaksearch/bitcask/xxx/*.data once it has 10M of "dead" or expired >>>>> data >>>>> in it, should be merged, right? >>>>> >>>>> I'm collecting the spritzer twitter stream and loading it into two >>>>> buckets >>>>> (one non-indexed bucket holds the full tweet, one indexed bucket holds >>>>> the >>>>> tweet string, id, date and username). I used to see about 10 GB of >>>>> data >>>>> total, but it's growing and currently at 26GB of data total. >>>>> >>>>> I'm seeing these in the logs: >>>>> >>>>> INFO REPORT==== 13-Jun-2011::08:28:19 === >>>>> Pid <0.6844.0> compacted 3 segments for 942232 bytes in 4.900694 >>>>> seconds, >>>>> 0.18 MB/sec >>>>> >>>>> =INFO REPORT==== 13-Jun-2011::08:29:01 === >>>>> Pid <0.6267.0> compacted 3 segments for 1721790 bytes in 9.690511 >>>>> seconds, >>>>> 0.17 MB/sec >>>>> >>>>> =INFO REPORT==== 13-Jun-2011::08:31:23 === >>>>> Pid <0.6924.0> compacted 3 segments for 6988416 bytes in 44.659753 >>>>> seconds, >>>>> 0.15 MB/sec >>>>> >>>>> ... but I'm not seeing any "merging" related entries. >>>>> >>>>> >>>>> - Steve >>>>> >>>>> -- >>>>> Steve Webb - Senior System Administrator for gnip.com >>>>> http://twitter.com/GnipWebb >>>>> >>>>> On Wed, 8 Jun 2011, Justin Sheehy wrote: >>>>> >>>>> Hi, Steve. >>>>> >>>>> >>>>>> Check out this page: >>>>>> >>>>>> >>>>>> http://wiki.basho.com/Bitcask-Configuration.html#Disk-Usage-and-Merging-Settings >>>>>> >>>>>> Basically, a "merge trigger" must be met in order to have the merge >>>>>> process occur. When it does occur, it will affect all existing files >>>>>> that >>>>>> meet a "merge threshold." >>>>>> >>>>>> One note that is relevant for your specific use: the expiry_secs >>>>>> parameter >>>>>> will cause a given item to disappear from the client API immediately >>>>>> after >>>>>> expiry, and to be cleaned if it is in a file already being merged, but >>>>>> will >>>>>> not currently contribute toward merge triggers or thresholds on its >>>>>> own >>>>>> if >>>>>> not otherwise "dead". >>>>>> >>>>>> -Justin >>>>>> >>>>>> >>>>>> On Jun 7, 2011, at 4:29 PM, Steve Webb wrote: >>>>>> >>>>>> Hello there. >>>>>> >>>>>> >>>>>>> >>>>>>> I'm curious - I'm up to about 10GB of storage and I'm guessing that >>>>>>> I'll >>>>>>> be full in 3-4 more days of ingesting data. I have no idea if/when a >>>>>>> merge >>>>>>> will run to expire the older data. >>>>>>> >>>>>>> I'm loading a 2-node (1GB mem, 20GB storage, vmware VMs) riaksearch >>>>>>> cluster with the spritzer twitter feed. I used the bitcask >>>>>>> 'expiry_secs' to >>>>>>> expire data after 3 days. Q: Is there a method or command to force a >>>>>>> merge >>>>>>> at any time? Q: Is there a way to run a merge when the storage size >>>>>>> reaches >>>>>>> a specific threshold? >>>>>>> >>>>>>> >>>>>>> - Steve >>>>>>> >>>>>>> -- >>>>>>> Steve Webb - Senior System Administrator for gnip.com >>>>>>> http://twitter.com/GnipWebb >>>>>>> >>>>>>> _______________________________________________ >>>>>>> riak-users mailing list >>>>>>> [email protected] >>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>>>>>> >>>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>> >>>>> riak-users mailing list >>>>> [email protected] >>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>>>> >>>>> >>>>> >>>> >>
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
