[ 
https://issues.apache.org/jira/browse/ACCUMULO-315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13188015#comment-13188015
 ] 

Eric Newton commented on ACCUMULO-315:
--------------------------------------

The master was performing a range-delete.  It split a tablet into three 
sections, to remove the center with tablet operations.

While it was working through the online->chop->offline state transition, the 
last tablet of the three had a split.  This caused the main loop to miss a 
tablet, and to have bad counts.  The master then mistakenly believed that all 
the tablets needed to be offline had been taken offline.  The master then 
updated the prevRow of the last tablet while the tablet was still online, which 
caused the hole in the metadata table.

                
> Hole in metadata table occurred during random walk test
> -------------------------------------------------------
>
>                 Key: ACCUMULO-315
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-315
>             Project: Accumulo
>          Issue Type: Bug
>          Components: master, tserver
>         Environment: Running 1.4.0 SNAPSHOT on 10 node cluster.
>            Reporter: Keith Turner
>            Assignee: Keith Turner
>            Priority: Critical
>             Fix For: 1.4.0
>
>
> While running the random walk test a hole in the metadata table occurred.  A 
> client tried to delete the table with the whole and the fate op got stuck.  
> Was continually seeing the following in the master logs.
> {noformat}
> 14 00:02:11,273 [tableOps.CleanUp] DEBUG: Still waiting for table to be 
> deleted: 4ct locationState: 
> 4ct;4d2d3be2823b0bf4;27b693c626c2d4ef@(null,xxx.xxx.xxx.xxx:9997[134d7425fc503e1],null)
> {noformat}
> The metadata table contained the following.  Tablet 4ct;4d2d3be2823b0bf4 had 
> a location.
> {noformat}
> 4ct;262249211a62cd6f ~tab:~pr []    \x011819e56edae21302
> 4ct;27b693c626c2d4ef ~tab:~pr []    \x01262249211a62cd6f
> 4ct;43422047c78fa52b ~tab:~pr []    \x0141ea825af0f262d9
> 4ct;4d2d3be2823b0bf4 ~tab:~pr []    \x0127b693c626c2d4ef
> 4ct;4f89df61392bb311 ~tab:~pr []    \x014d2d3be2823b0bf4
> {noformat}
> Found the following events on a tablet server.
> {noformat}
> 21:36:04,369 [tabletserver.Tablet] TABLET_HIST: 
> 4ct;4d2d3be2823b0bf4;27b693c626c2d4ef split 
> 4ct;41ea825af0f262d9;27b693c626c2d4ef 4ct;4d2d3be2823b0bf4;41ea825af0f262d9
> 21:36:06,351 [tabletserver.Tablet] TABLET_HIST: 
> 4ct;4d2d3be2823b0bf4;41ea825af0f262d9 split 
> 4ct;43422047c78fa52b;41ea825af0f262d9 4ct;4d2d3be2823b0bf4;43422047c78fa52b
> {noformat}
> Saw the following on the tablet server serving the metadata tablet at around 
> the time of the splits.  Not sure if this is related.
> {noformat}
> 13 21:36:10,956 [server.TNonblockingServer] WARN : Got an IOException in 
> internalRead!
> java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcher.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:171)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
>         at 
> org.apache.thrift.transport.TNonblockingSocket.read(TNonblockingSocket.java:141)
>         at 
> org.apache.thrift.server.TNonblockingServer$FrameBuffer.internalRead(TNonblockingServer.java:668)
>         at 
> org.apache.thrift.server.TNonblockingServer$FrameBuffer.read(TNonblockingServer.java:457)
>         at 
> org.apache.thrift.server.TNonblockingServer$SelectThread.handleRead(TNonblockingServer.java:358)
>         at 
> org.apache.thrift.server.TNonblockingServer$SelectThread.select(TNonblockingServer.java:303)
>         at 
> org.apache.thrift.server.TNonblockingServer$SelectThread.run(TNonblockingServer.java:242)
> {noformat}
> Not sure what caused the metadata problem.  Further investigation is needed.  
> Also, while debugging the master started assigning and unassigning metadata 
> tablets rapidly.  Did not get a change to investigate this, it stopped when I 
> stopped the random walk test.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to