Well it's even better than that ;) We have optional log flushing which by default is 10 secs. Make that 100 milliseconds and that's as much data you can lose. If any other table syncs then this table's edits are also synced.
J-D On Tue, Nov 17, 2009 at 4:36 PM, Jonathan Gray <jl...@streamy.com> wrote: > Thoughts on a client-facing call to explicit call a WAL sync? So I could > turn on DEFERRED_LOG_FLUSH (possibly leave it on always), run a batch of > my inserts, and then run an explicit flush/sync. The returning of that > call would guarantee to the client that the data up to that point is safe. > > JG > > On Mon, November 16, 2009 11:00 am, Jean-Daniel Cryans wrote: >> I added a new feature for tables called "deferred flush", see >> https://issues.apache.org/jira/browse/HBASE-1944 >> >> >> My opinion is that the default should be paranoid enough to not lose >> any user data. If we can change a table's attribute without taking it down >> (there's a jira on that), wouldn't that solve the import problem? >> >> >> For example: have some table that needs to have fast insertion via MR. >> During the creation of the job, you change the table's >> DEFERRED_LOG_FLUSH to "true", then run the job and finally set the >> value to false when the job is done. >> >> This way you still pass the responsibility to the user but for >> performance reasons. >> >> J-D >> >> >> On Mon, Nov 16, 2009 at 2:05 AM, Cosmin Lehene <cleh...@adobe.com> wrote: >> >>> We could have a speedy default and an extra parameter for puts that >>> would specify a flush is needed. This way you pass the responsibility to >>> the user and he can decide if he needs to be paranoid or not. This could >>> be part of Put and even specify granularity of the flush if needed. >>> >>> >>> Cosmin >>> >>> >>> >>> On 11/15/09 6:59 PM, "Andrew Purtell" <apurt...@apache.org> wrote: >>> >>> >>>> I agree with this. >>>> >>>> >>>> I also think we should leave the default as is with the caveat that >>>> we call out the durability versus write performance tradeoff in the >>>> flushlogentries description and up on the wiki somewhere, maybe on >>>> http://wiki.apache.org/hadoop/PerformanceTuning . We could also >>>> provide two example configurations, one for performance (reasonable >>>> tradeoffs), one for paranoia. I put up an issue: >>>> https://issues.apache.org/jira/browse/HBASE-1984 >>>> >>>> >>>> - Andy >>>> >>>> >>>> >>>> >>>> >>>> ________________________________ >>>> From: Ryan Rawson <ryano...@gmail.com> >>>> To: hbase-dev@hadoop.apache.org >>>> Sent: Sat, November 14, 2009 11:22:13 PM >>>> Subject: Re: Should we change the default value of >>>> hbase.regionserver.flushlogentries for 0.21? >>>> >>>> That sync at the end of a RPC is my doing. You dont want to sync >>>> every _EDIT_, after all, the previous definition of the word "edit" >>>> was each KeyValue. So we could be calling sync for every single >>>> column in a row. Bad stuff. >>>> >>>> In the end, if the regionserver crashes during a batch put, we will >>>> never know how much of the batch was flushed to the WAL. Thus it makes >>>> sense to only do it once and get a massive, massive, speedup. >>>> >>>> On Sat, Nov 14, 2009 at 9:45 PM, stack <st...@duboce.net> wrote: >>>> >>>>> I'm for leaving it as it is, at every 100 edits -- maybe every 10 >>>>> edits? Speed stays as it was. We used to lose MBs. By default, >>>>> we'll now lose 99 or 9 edits max. >>>>> >>>>> We need to do some work bringing folks along regardless of what we >>>>> decide. Flush happens at the end of the put up in the regionserver. >>>>> If you are >>>>> doing a batch of commits -- e.g. using a big write buffer over on >>>>> your client -- the puts will only be flushed on the way out after >>>>> the batch put completes EVEN if you have configured hbase to sync >>>>> every edit (I ran into this this evening. J-D sorted me out). We >>>>> need to make sure folks are up on this. >>>>> >>>>> St.Ack >>>>> >>>>> >>>>> >>>>> >>>>> On Sat, Nov 14, 2009 at 4:37 PM, Jean-Daniel Cryans >>>>> <jdcry...@apache.org>wrote: >>>>> >>>>> >>>>>> Hi dev! >>>>>> >>>>>> >>>>>> Hadoop 0.21 now has a reliable append and flush feature and this >>>>>> gives us the opportunity to review some assumptions. The current >>>>>> situation: >>>>>> >>>>>> >>>>>> - Every edit going to a catalog table is flushed so there's no >>>>>> data loss. - The user tables edits are flushed every >>>>>> hbase.regionserver.flushlogentries which by default is 100. >>>>>> >>>>>> Should we now set this value to 1 in order to have more durable >>>>>> but slower inserts by default? Please speak up. >>>>>> >>>>>> Thx, >>>>>> >>>>>> >>>>>> J-D >>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>>> >>> >>> >> >> > >