[ 
https://issues.apache.org/jira/browse/HBASE-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848987#action_12848987
 ] 

ryan rawson commented on HBASE-2353:
------------------------------------

I have new numbers, basically my bulk puts are now much slower than previously. 
 This is a killer for us.  Single thread import performance is now down to 
2000-6000 rows/sec, down from 16,000+.  

The first fix to this is to bring back deferred log flush.  I have a 
forthcoming patch. 

Here are my arguments:

- There is no multi-row atomicity guarantee. Having other clients see the 
partial results of your batch put is acceptable because that is our consistency 
model - per row. That is the defacto situation right now anyways.
- If the call succeeds, then we expect the puts to be durable.  By ensuring 
syncFs() call returns before returning to the client we have this.
- Partial failure by exception leaves the HLog in an uncertain state.  The 
client will not know how many rows were successfully made durable, and thus 
would be required to redo the put.   
- Partial "failure" by return code means only part of the rows were made 
durable and available to other clients.  This is normal and covered by the 
above cases I think.

Given this, what makes the most sense?  It seems like hlog.append() then 
syncFs() of all the puts, THEN memstore mutate is the way to go. In HRS.put our 
protection from 'over memory' is this call :

        this.cacheFlusher.reclaimMemStoreMemory();

which will synchronously flush until we arent going to go over memory.  If we 
somehow fail to add to memstore, it would be OOME which would kill the RS 
anyways.  Considering the data for the Put is already in memory and we are just 
adjusting data structure nodes, it seems unlikely that we'd be in this case 
often/ever.

> HBASE-2283 removed bulk sync optimization for multi-row puts
> ------------------------------------------------------------
>
>                 Key: HBASE-2353
>                 URL: https://issues.apache.org/jira/browse/HBASE-2353
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: ryan rawson
>             Fix For: 0.21.0
>
>         Attachments: HBASE-2353-deferred.txt
>
>
> previously to HBASE-2283 we used to call flush/sync once per put(Put[]) call 
> (ie: batch of commits).  Now we do for every row.  
> This makes bulk uploads slower if you are using WAL.  Is there an acceptable 
> solution to achieve both safety and performance by bulk-sync'ing puts?  Or 
> would this not work in face of atomic guarantees?
> discuss!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to