[
https://issues.apache.org/jira/browse/ACCUMULO-315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13195284#comment-13195284
]
Keith Turner commented on ACCUMULO-315:
---------------------------------------
With the most recent changes against this ticket, including the bug fixes I
just committed, I am seeing merge operations get stuck during random walk test.
> Hole in metadata table occurred during random walk test
> -------------------------------------------------------
>
> Key: ACCUMULO-315
> URL: https://issues.apache.org/jira/browse/ACCUMULO-315
> Project: Accumulo
> Issue Type: Bug
> Components: master, tserver
> Environment: Running 1.4.0 SNAPSHOT on 10 node cluster.
> Reporter: Keith Turner
> Assignee: Keith Turner
> Priority: Critical
> Fix For: 1.4.0
>
>
> While running the random walk test a hole in the metadata table occurred. A
> client tried to delete the table with the whole and the fate op got stuck.
> Was continually seeing the following in the master logs.
> {noformat}
> 14 00:02:11,273 [tableOps.CleanUp] DEBUG: Still waiting for table to be
> deleted: 4ct locationState:
> 4ct;4d2d3be2823b0bf4;27b693c626c2d4ef@(null,xxx.xxx.xxx.xxx:9997[134d7425fc503e1],null)
> {noformat}
> The metadata table contained the following. Tablet 4ct;4d2d3be2823b0bf4 had
> a location.
> {noformat}
> 4ct;262249211a62cd6f ~tab:~pr [] \x011819e56edae21302
> 4ct;27b693c626c2d4ef ~tab:~pr [] \x01262249211a62cd6f
> 4ct;43422047c78fa52b ~tab:~pr [] \x0141ea825af0f262d9
> 4ct;4d2d3be2823b0bf4 ~tab:~pr [] \x0127b693c626c2d4ef
> 4ct;4f89df61392bb311 ~tab:~pr [] \x014d2d3be2823b0bf4
> {noformat}
> Found the following events on a tablet server.
> {noformat}
> #the tablet server events below are caused by the delete range operation
> 13 21:36:04,287 [tabletserver.Tablet] TABLET_HIST:
> 4ct;4d2d3be2823b0bf4;262249211a62cd6f split
> 4ct;27b693c626c2d4ef;262249211a62cd6f 4ct;4d2d3be2823b0bf4;27b693c626c2d4ef
> 13 21:36:04,369 [tabletserver.Tablet] TABLET_HIST:
> 4ct;4d2d3be2823b0bf4;27b693c626c2d4ef split
> 4ct;41ea825af0f262d9;27b693c626c2d4ef 4ct;4d2d3be2823b0bf4;41ea825af0f262d9
> 13 21:36:04,370 [tabletserver.Tablet] TABLET_HIST:
> 4ct;4d2d3be2823b0bf4;41ea825af0f262d9 opened
> 13 21:36:06,141 [tabletserver.Tablet] TABLET_HIST:
> 4ct;4d2d3be2823b0bf4;41ea825af0f262d9 closed
> 13 21:36:06,142 [tabletserver.Tablet] DEBUG: Files for low split
> 4ct;43422047c78fa52b;41ea825af0f262d9 [/t-0001cdi/F0001bmw.rf,
> /t-0001cdi/F0001bn1.rf]
> 13 21:36:06,142 [tabletserver.Tablet] DEBUG: Files for high split
> 4ct;4d2d3be2823b0bf4;43422047c78fa52b [/t-0001cdi/A0001cef.rf,
> /t-0001cdi/F0001bmw.rf, /t-0001cdi/F0001bn1.rf]
> #split from other random walker
> 13 21:36:06,351 [tabletserver.Tablet] TABLET_HIST:
> 4ct;4d2d3be2823b0bf4;41ea825af0f262d9 split
> 4ct;43422047c78fa52b;41ea825af0f262d9 4ct;4d2d3be2823b0bf4;43422047c78fa52b
> {noformat}
> The following events occurred on the master and overlap in time with the
> split on the tablet server.
> {noformat}
> 13 21:36:06,312 [master.EventCoordinator] INFO : Merge state of
> 4ct;41ea825af0f262d9;27b693c626c2d4ef set to MERGING
> 13 21:36:06,312 [master.Master] DEBUG: Deleting tablets for
> 4ct;41ea825af0f262d9;27b693c626c2d4ef
> 13 21:36:06,316 [master.Master] DEBUG: Found following tablet
> 4ct;4d2d3be2823b0bf4;43422047c78fa52b
> 13 21:36:06,317 [master.Master] DEBUG: Making file deletion entries for
> 4ct;41ea825af0f262d9;27b693c626c2d4ef
> 13 21:36:06,325 [master.Master] DEBUG: Removing metadata table entries in
> range [4ct;27b693c626c2d4ef%00; : [] 9223372036854775807
> false,4ct;41ea825af0f262d9%00; : [] 9223372036854775807 false)
> 13 21:36:06,331 [master.Master] DEBUG: Updating prevRow of
> 4ct;4d2d3be2823b0bf4;43422047c78fa52b to 27b693c626c2d4ef
> {noformat}
> After many hours of debugging Eric and I figured out what was going on. Two
> random walkers were running the concurrent test. One client initiated a
> delete range on table id 4ct for the range 27b693c626c2d4ef to
> 41ea825af0f262d9. While this delete range operation was occurring another
> client add the split point 43422047c78fa52b. The master read the metadata
> table while the split was occurring and got inconsistent/incomplete
> information about what tablets related to the delete range operation were
> online. It assumed the required tablets were offline when they were not.
> The log messages above show that the split and updating of the prevRow by the
> master overlap in time.
> We think the best solution is to ensure that scans of the metadata table for
> merges and delete range are consistent with respect to end row and prev end
> row matching. Can not consider tablets individually. Must ensure the
> portion of the metadata table under consideration forms a proper sorted
> linked list.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira