[jira] [Commented] (ACCUMULO-4542) Tablet left in bad state after bulk import timeout

Christopher Tubbs (JIRA) Tue, 03 Jan 2017 14:54:13 -0800

    [ 
https://issues.apache.org/jira/browse/ACCUMULO-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15796461#comment-15796461
 ]


Christopher Tubbs commented on ACCUMULO-4542:
---------------------------------------------

This seems really hard to reproduce. [~kturner] tells me he believes there is a 
final check before it moves, and it might do a copy instead of a move, if it 
has failed for some tablets but not others (in the case of the file overlapping 
several tablets). If he's right, then it's possible there was a failure reading 
the metadata table to confirm, and the system treated this failure to validate 
as a false-positive failure to assign. I'm not sure there's a sane way to 
handle that case... which is better than the result you saw.

> Tablet left in bad state after bulk import timeout
> --------------------------------------------------
>
>                 Key: ACCUMULO-4542
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4542
>             Project: Accumulo
>          Issue Type: Bug
>    Affects Versions: 1.7.2
>            Reporter: John Vines
>
> On a cluster we saw a large amount of network issues at one point. Cause 
> still has not been pinpointed, but it did result in us seeing a lot of rpc 
> exceptions and the like.
> While these network issues happened, a bulk import was kicked off for a 
> single file. This single file was assigned to two tablets (which both 
> happened to be on the same server). Unfortunately, in the 3 attempts bulk 
> import made to assign this file to this tablet, there were 3 rpc exceptions 
> due to a socket timeout. After the three failures the bulk import went ahead 
> and moved this file to the failures directory and carried on.
> Unfortunately, this file was actually assigned to the tablet succesfully on 
> the first attempt. The following 2 attempts logged about how the server had 
> already been assigned this file. It was shortly afterward a query came in 
> (and then later major compactions) which then complained about how the file 
> could not be found because the bulk import moved it to the failures directory.
> I think in this event we need some sort of final validation the record didn't 
> end up in the metadata table before we move it to the failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ACCUMULO-4542) Tablet left in bad state after bulk import timeout

Reply via email to