[ 
https://issues.apache.org/jira/browse/HBASE-2898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-2898:
-------------------------

    Fix Version/s: 0.90.0

Bringing into 0.90.  Agree its a blocker.

> MultiPut makes proper error handling impossible and leads to corrupted data
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-2898
>                 URL: https://issues.apache.org/jira/browse/HBASE-2898
>             Project: HBase
>          Issue Type: Bug
>          Components: client, regionserver
>    Affects Versions: 0.89.20100621
>            Reporter: Benoit Sigoure
>            Priority: Blocker
>             Fix For: 0.90.0
>
>
> The {{MultiPut}} RPC needs to be completely rewritten.  Let's see why step by 
> step.
> # An HBase user calls any of the {{put}} methods on an {{HTable}} instance.
> # Eventually, {{HTable#flushCommits}} is invoked to actually send the edits 
> to the RegionServer(s).
> # This takes us to {{HConnectionManager#processBatchOfPuts}} where all edits 
> are sorted into one or more {{MultiPut}}.  Each {{MultiPut}} is aggregating 
> all the edits that are going to a particular RegionServer.
> # A thread pool is used to send all the {{MultiPut}} in parallel to their 
> respective RegionServer.  Let's follow what happens for a single {{MultiPut}}.
> # The {{MultiPut}} travels through the IPC code on the client and then 
> through the network and then through the IPC code on the RegionServer.
> # We're now in {{HRegionServer#multiPut}} where a new {{MultiPutResponse}} is 
> created.
> # Still in {{HRegionServer#multiPut}}.  Since a {{MultiPut}} is essentially a 
> map from region name to a list of {{Put}} for that region, there's a {{for}} 
> loop that executes each list of {{Put}} for each region sequentially.  Let's 
> follow what happens for a single list of {{Put}} for a particular region.
> # We're now in {{HRegionServer#put(byte[], List<Put>)}}.  Each {{Put}} is 
> associated with the row lock that was specified by the client (if any).  Then 
> the pairs of {{(Put, lock id)}} are handed to the right {{HRegion}}.
> # Now we're in {{HRegion#put(Pair<Put, Integer>[])}}, which immediately takes 
> us to {{HRegion#doMiniBatchPut}}.
> # At this point, let's assume that we're doing just 2 edits.  So the 
> {{BatchOperationInProgress}} that {{doMiniBatchPut}} has contains just 2 
> {{Put}}.
> # The {{while}} loop in {{doMiniBatchPut}} that's going to execute each 
> {{Put}} starts.
> # The first {{Put}} succeeds normally, so at the end of the first iteration, 
> its {{batchOp.retCodes}} is marked as {{OperationStatusCode.SUCCESS}}.
> # The second {{Put}} fails because an exception is thrown when appending the 
> edit to the {{WAL}}.  Its {{batchOp.retCodes}} is marked as 
> {{OperationStatusCode.FAILURE}}.
> # {{doMiniBatchPut}} is done and returns to {{HRegion#put(Pair<Put, 
> Integer>[])}}, which returns to {{HRegionServer#put(byte[], List<Put>)}}.
> # At this point, {{HRegionServer#put(byte[], List<Put>)}} does a {{for}} loop 
> and extracts the *first* failure out of the {{OperationStatusCode[]}}.  In 
> our case, it'll take the {{OperationStatusCode.FAILURE}} that came from the 
> second {{Put}}.
> # This error code is returned to {{HRegionServer#multiPut}}, which records in 
> the {{MultiPutResponse}} that there was a failure - but *we don't know for 
> which {{Put}}*
> So the client has no way of knowing which {{Put}} failed.  All it knows is 
> that at least one of them failed for a particular region.  Its best bet is to 
> retry all the {{Put}} for this region.  But this has an unintended 
> consequence.  The {{Put}} that were successful during the first run will be 
> *re-applied*.  This will unexpectedly create extra versions.  Now I realize 
> most people don't really care about versions, so they won't notice.  But 
> whoever relies on the versions for whatever reason will rightfully consider 
> this to be data corruption.
> {{MultiPut}} makes proper error handling impossible.  Since this RPC cannot 
> guarantee any atomicity other than at the individual {{Put}} level, it *must* 
> return to the client specific information about which {{Put}} failed in case 
> of a failure, so that the client can do proper error handling.
> This requires us to change the {{MultiPutResponse}} so that it can indicate 
> which {{Put}} specifically failed.  We could do this for instance by giving 
> the index of the {{Put}} along with its {{OperationStatusCode}}.  So in the 
> scenario above, the {{MultiPutResponse}} would essentially return something 
> like: "for that particular region, put #0 succeeded, put #1 failed".  If we 
> want to save a bit of space, we may want to omit the successes from the 
> response and only mention the failures - so a response that doesn't mention 
> any failure means that everything was successful.  Not sure whether that's a 
> good idea though.
> Since doing this require an incompatible RPC change, I propose that we take 
> the opportunity to rewrite the {{MultiPut}} RPC too.  Right now it's very 
> inefficient, it's just a hack on top of {{Put}}.  When {{MultiPut}} is 
> written to the wire, a lot of unnecessary duplicate data is sent out.  The 
> timestamp, the row key and the family are sent out to the wire N+1 times, 
> where N is the number of edits for a particular row, instead of just once (!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to