[
https://issues.apache.org/jira/browse/ACCUMULO-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14623132#comment-14623132
]
Josh Elser commented on ACCUMULO-3937:
--------------------------------------
I also increased this from 5 to 15 because we only sleep for 100ms between
attempts. We should likely have an increasing backoff during failures to
trigger this more reliably.
> Hard-coded HDFS failure tolerance
> ---------------------------------
>
> Key: ACCUMULO-3937
> URL: https://issues.apache.org/jira/browse/ACCUMULO-3937
> Project: Accumulo
> Issue Type: Bug
> Components: tserver
> Affects Versions: 1.7.0
> Reporter: Josh Elser
> Assignee: Josh Elser
> Priority: Blocker
> Fix For: 1.7.1, 1.8.0
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> ACCUMULO-2480 added an error cache to the TabletServer which makes the
> tserver kill itself after 5 errors creating a new WAL file within 10 seconds.
> This is painful because it now causes Accumulo to kill itself if HDFS is
> restarted beneath Accumulo. Previously, I would have expected Accumulo to
> just keep on chugging if HDFS goes away. Now, I'll have to restart it when
> HDFS returns.
> This should be a configuration property instead of being hard-coded.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)