John Vines created ACCUMULO-4542:
------------------------------------

             Summary: Tablet left in bad state after bulk import timeout
                 Key: ACCUMULO-4542
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4542
             Project: Accumulo
          Issue Type: Bug
    Affects Versions: 1.7.2
            Reporter: John Vines


On a cluster we saw a large amount of network issues at one point. Cause still 
has not been pinpointed, but it did result in us seeing a lot of rpc exceptions 
and the like.

While these network issues happened, a bulk import was kicked off for a single 
file. This single file was assigned to two tablets (which both happened to be 
on the same server). Unfortunately, in the 3 attempts bulk import made to 
assign this file to this tablet, there were 3 rpc exceptions due to a socket 
timeout. After the three failures the bulk import went ahead and moved this 
file to the failures directory and carried on.

Unfortunately, this file was actually assigned to the tablet succesfully on the 
first attempt. The following 2 attempts logged about how the server had already 
been assigned this file. It was shortly afterward a query came in (and then 
later major compactions) which then complained about how the file could not be 
found because the bulk import moved it to the failures directory.

I think in this event we need some sort of final validation the record didn't 
end up in the metadata table before we move it to the failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to