[
https://issues.apache.org/jira/browse/ACCUMULO-4615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16306395#comment-16306395
]
Jeff Schmidt commented on ACCUMULO-4615:
----------------------------------------
Hi! I just wanted to ping on this ticket as our project ran into an issue where
the mentioned code caused a ConcurrentModificationException during the Master
attempting to balance causing several "Error balancing tablets, will wait for 1
(seconds) and then retry".
In Master#StatusThread.updateStatus(), we do:
{code}
tserverStatus =
Collections.synchronizedSortedMap(gatherTableInformation(currentServers));
{code}
That resulting map gets used by the balancer. However, even though it's wrapped
in a synchronized map and an unmodifiable map, the inner map (the one returned
by gatherTableInformation()) is potentially being concurrently modified due to
the way we shutdown the status thread pool.
In StatusThread.gatherTableInformation(), we do:
{code}
tp.shutdown();
try {
tp.awaitTermination(....);
} catch (InterruptedException e) {
log.debug(...);
}
{code}
The shutdown request allows submitted tasks to continue but prevents new tasks
from being submitted. So slow reporting tservers can still modify this list
after this method returns, leading to the ConcurrentModificationException.
I'd be happy to work on a fix if no one has started tackling this yet.
> ThreadPool timeout when checking tserver stats is confusing
> -----------------------------------------------------------
>
> Key: ACCUMULO-4615
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4615
> Project: Accumulo
> Issue Type: Bug
> Components: master
> Affects Versions: 1.8.1
> Reporter: Michael Wall
> Priority: Minor
> Fix For: 1.8.2, 2.0.0
>
>
> If it takes longer than the configured time to gather information from all
> the tablet servers, the thread pool stops and processing continues with
> whatever has been collected. Code is
> https://github.com/apache/accumulo/blob/1.8/server/master/src/main/java/org/apache/accumulo/master/Master.java#L1120,
> default timeout is 6s. Does not appear to be an issue prior to 1.8.
> Best case, this was really confusing. The monitor page would have 30
> tservers, then 5 tservers. Didn't really see any other negative effects, no
> migrations and no balancing appeared to be affected. Worse case though, I
> missed something and the master is making decisions based on incomplete
> information.
> [[email protected]] please add more info if needed.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)