[jira] [Commented] (ACCUMULO-4000) log recovery failed after hard reset

Eric Newton (JIRA) Fri, 18 Sep 2015 11:58:35 -0700

    [ 
https://issues.apache.org/jira/browse/ACCUMULO-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876149#comment-14876149
 ]


Eric Newton commented on ACCUMULO-4000:
---------------------------------------

More information:

* decommissioned blocks were still running tservers, so most likely had WALs on 
the local datanodes
* this particular system primarily uses bulk loading, so the WALs don't roll 
for days
* the decommissioned servers refused to shutdown, so the SAs killed them 
(yikes!)

In response to my own questions:
* are blocks open for writing re-replicated?
** no, the normal pipeline provides some safety, but decommissioning does not 
replicate the open blocks
* if a WAL isn't used, and all the datanodes in the pipline go away, will the 
tserver get an error?
** no, they won't
* will a datanode decommission if there are still open writers?
** in this case, they were helped along, but it seems that they won't

We may need to create a utility to request that tservers cycle their WALs so 
that the current files are closed, replicated and the decommission can complete.


> log recovery failed after hard reset
> ------------------------------------
>
>                 Key: ACCUMULO-4000
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4000
>             Project: Accumulo
>          Issue Type: Bug
>    Affects Versions: 1.6.2
>         Environment: very large cluster, accumulo 1.6.2, hadoop 2.5.0 (cdh 
> 5.3)
>            Reporter: Eric Newton
>            Assignee: Eric Newton
>
> Had a hardware failure on a single node within a large cluster.  Tablets were 
> migrated away, but one tablet would not recover.  The Closer run by the 
> master to release the write lease on the WAL failed repeatedly.
> Afterwards, it was determined the file was small, probably just opened and 
> used at the moment the machine failed.  The block could not be recovered from 
> any replicas.
> One question raised: does the write pipeline acknowledge the sync, before the 
> write pipeline completes?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ACCUMULO-4000) log recovery failed after hard reset

Reply via email to