[
https://issues.apache.org/jira/browse/HBASE-2898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
stack updated HBASE-2898:
-------------------------
Fix Version/s: 0.90.0
Bringing into 0.90. Agree its a blocker.
> MultiPut makes proper error handling impossible and leads to corrupted data
> ---------------------------------------------------------------------------
>
> Key: HBASE-2898
> URL: https://issues.apache.org/jira/browse/HBASE-2898
> Project: HBase
> Issue Type: Bug
> Components: client, regionserver
> Affects Versions: 0.89.20100621
> Reporter: Benoit Sigoure
> Priority: Blocker
> Fix For: 0.90.0
>
>
> The {{MultiPut}} RPC needs to be completely rewritten. Let's see why step by
> step.
> # An HBase user calls any of the {{put}} methods on an {{HTable}} instance.
> # Eventually, {{HTable#flushCommits}} is invoked to actually send the edits
> to the RegionServer(s).
> # This takes us to {{HConnectionManager#processBatchOfPuts}} where all edits
> are sorted into one or more {{MultiPut}}. Each {{MultiPut}} is aggregating
> all the edits that are going to a particular RegionServer.
> # A thread pool is used to send all the {{MultiPut}} in parallel to their
> respective RegionServer. Let's follow what happens for a single {{MultiPut}}.
> # The {{MultiPut}} travels through the IPC code on the client and then
> through the network and then through the IPC code on the RegionServer.
> # We're now in {{HRegionServer#multiPut}} where a new {{MultiPutResponse}} is
> created.
> # Still in {{HRegionServer#multiPut}}. Since a {{MultiPut}} is essentially a
> map from region name to a list of {{Put}} for that region, there's a {{for}}
> loop that executes each list of {{Put}} for each region sequentially. Let's
> follow what happens for a single list of {{Put}} for a particular region.
> # We're now in {{HRegionServer#put(byte[], List<Put>)}}. Each {{Put}} is
> associated with the row lock that was specified by the client (if any). Then
> the pairs of {{(Put, lock id)}} are handed to the right {{HRegion}}.
> # Now we're in {{HRegion#put(Pair<Put, Integer>[])}}, which immediately takes
> us to {{HRegion#doMiniBatchPut}}.
> # At this point, let's assume that we're doing just 2 edits. So the
> {{BatchOperationInProgress}} that {{doMiniBatchPut}} has contains just 2
> {{Put}}.
> # The {{while}} loop in {{doMiniBatchPut}} that's going to execute each
> {{Put}} starts.
> # The first {{Put}} succeeds normally, so at the end of the first iteration,
> its {{batchOp.retCodes}} is marked as {{OperationStatusCode.SUCCESS}}.
> # The second {{Put}} fails because an exception is thrown when appending the
> edit to the {{WAL}}. Its {{batchOp.retCodes}} is marked as
> {{OperationStatusCode.FAILURE}}.
> # {{doMiniBatchPut}} is done and returns to {{HRegion#put(Pair<Put,
> Integer>[])}}, which returns to {{HRegionServer#put(byte[], List<Put>)}}.
> # At this point, {{HRegionServer#put(byte[], List<Put>)}} does a {{for}} loop
> and extracts the *first* failure out of the {{OperationStatusCode[]}}. In
> our case, it'll take the {{OperationStatusCode.FAILURE}} that came from the
> second {{Put}}.
> # This error code is returned to {{HRegionServer#multiPut}}, which records in
> the {{MultiPutResponse}} that there was a failure - but *we don't know for
> which {{Put}}*
> So the client has no way of knowing which {{Put}} failed. All it knows is
> that at least one of them failed for a particular region. Its best bet is to
> retry all the {{Put}} for this region. But this has an unintended
> consequence. The {{Put}} that were successful during the first run will be
> *re-applied*. This will unexpectedly create extra versions. Now I realize
> most people don't really care about versions, so they won't notice. But
> whoever relies on the versions for whatever reason will rightfully consider
> this to be data corruption.
> {{MultiPut}} makes proper error handling impossible. Since this RPC cannot
> guarantee any atomicity other than at the individual {{Put}} level, it *must*
> return to the client specific information about which {{Put}} failed in case
> of a failure, so that the client can do proper error handling.
> This requires us to change the {{MultiPutResponse}} so that it can indicate
> which {{Put}} specifically failed. We could do this for instance by giving
> the index of the {{Put}} along with its {{OperationStatusCode}}. So in the
> scenario above, the {{MultiPutResponse}} would essentially return something
> like: "for that particular region, put #0 succeeded, put #1 failed". If we
> want to save a bit of space, we may want to omit the successes from the
> response and only mention the failures - so a response that doesn't mention
> any failure means that everything was successful. Not sure whether that's a
> good idea though.
> Since doing this require an incompatible RPC change, I propose that we take
> the opportunity to rewrite the {{MultiPut}} RPC too. Right now it's very
> inefficient, it's just a hack on top of {{Put}}. When {{MultiPut}} is
> written to the wire, a lot of unnecessary duplicate data is sent out. The
> timestamp, the row key and the family are sent out to the wire N+1 times,
> where N is the number of edits for a particular row, instead of just once (!).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.