MultiPut makes proper error handling impossible and leads to corrupted data
---------------------------------------------------------------------------

                 Key: HBASE-2898
                 URL: https://issues.apache.org/jira/browse/HBASE-2898
             Project: HBase
          Issue Type: Bug
          Components: client, regionserver
    Affects Versions: 0.89.20100621
            Reporter: Benoit Sigoure
            Priority: Blocker


The {{MultiPut}} RPC needs to be completely rewritten.  Let's see why step by 
step.

# An HBase user calls any of the {{put}} methods on an {{HTable}} instance.
# Eventually, {{HTable#flushCommits}} is invoked to actually send the edits to 
the RegionServer(s).
# This takes us to {{HConnectionManager#processBatchOfPuts}} where all edits 
are sorted into one or more {{MultiPut}}.  Each {{MultiPut}} is aggregating all 
the edits that are going to a particular RegionServer.
# A thread pool is used to send all the {{MultiPut}} in parallel to their 
respective RegionServer.  Let's follow what happens for a single {{MultiPut}}.
# The {{MultiPut}} travels through the IPC code on the client and then through 
the network and then through the IPC code on the RegionServer.
# We're now in {{HRegionServer#multiPut}} where a new {{MultiPutResponse}} is 
created.
# Still in {{HRegionServer#multiPut}}.  Since a {{MultiPut}} is essentially a 
map from region name to a list of {{Put}} for that region, there's a {{for}} 
loop that executes each list of {{Put}} for each region sequentially.  Let's 
follow what happens for a single list of {{Put}} for a particular region.
# We're now in {{HRegionServer#put(byte[], List<Put>)}}.  Each {{Put}} is 
associated with the row lock that was specified by the client (if any).  Then 
the pairs of {{(Put, lock id)}} are handed to the right {{HRegion}}.
# Now we're in {{HRegion#put(Pair<Put, Integer>[])}}, which immediately takes 
us to {{HRegion#doMiniBatchPut}}.
# At this point, let's assume that we're doing just 2 edits.  So the 
{{BatchOperationInProgress}} that {{doMiniBatchPut}} has contains just 2 
{{Put}}.
# The {{while}} loop in {{doMiniBatchPut}} that's going to execute each {{Put}} 
starts.
# The first {{Put}} succeeds normally, so at the end of the first iteration, 
its {{batchOp.retCodes}} is marked as {{OperationStatusCode.SUCCESS}}.
# The second {{Put}} fails because an exception is thrown when appending the 
edit to the {{WAL}}.  Its {{batchOp.retCodes}} is marked as 
{{OperationStatusCode.FAILURE}}.
# {{doMiniBatchPut}} is done and returns to {{HRegion#put(Pair<Put, 
Integer>[])}}, which returns to {{HRegionServer#put(byte[], List<Put>)}}.
# At this point, {{HRegionServer#put(byte[], List<Put>)}} does a {{for}} loop 
and extracts the *first* failure out of the {{OperationStatusCode[]}}.  In our 
case, it'll take the {{OperationStatusCode.FAILURE}} that came from the second 
{{Put}}.
# This error code is returned to {{HRegionServer#multiPut}}, which records in 
the {{MultiPutResponse}} that there was a failure - but *we don't know for 
which {{Put}}*

So the client has no way of knowing which {{Put}} failed.  All it knows is that 
at least one of them failed for a particular region.  Its best bet is to retry 
all the {{Put}} for this region.  But this has an unintended consequence.  The 
{{Put}} that were successful during the first run will be *re-applied*.  This 
will unexpectedly create extra versions.  Now I realize most people don't 
really care about versions, so they won't notice.  But whoever relies on the 
versions for whatever reason will rightfully consider this to be data 
corruption.

{{MultiPut}} makes proper error handling impossible.  Since this RPC cannot 
guarantee any atomicity other than at the individual {{Put}} level, it *must* 
return to the client specific information about which {{Put}} failed in case of 
a failure, so that the client can do proper error handling.

This requires us to change the {{MultiPutResponse}} so that it can indicate 
which {{Put}} specifically failed.  We could do this for instance by giving the 
index of the {{Put}} along with its {{OperationStatusCode}}.  So in the 
scenario above, the {{MultiPutResponse}} would essentially return something 
like: "for that particular region, put #0 succeeded, put #1 failed".  If we 
want to save a bit of space, we may want to omit the successes from the 
response and only mention the failures - so a response that doesn't mention any 
failure means that everything was successful.  Not sure whether that's a good 
idea though.

Since doing this require an incompatible RPC change, I propose that we take the 
opportunity to rewrite the {{MultiPut}} RPC too.  Right now it's very 
inefficient, it's just a hack on top of {{Put}}.  When {{MultiPut}} is written 
to the wire, a lot of unnecessary duplicate data is sent out.  The timestamp, 
the row key and the family are sent out to the wire N+1 times, where N is the 
number of edits for a particular row, instead of just once (!).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to