[jira] [Created] (ACCUMULO-334) Bulk random walk test failed

Keith Turner (Created) (JIRA) Thu, 19 Jan 2012 13:51:08 -0800

Bulk random walk test failed
----------------------------

                 Key: ACCUMULO-334
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-334
             Project: Accumulo
          Issue Type: Bug
          Components: test
         Environment: Running random walk test against 1.4.0-SNAPSHOT on 10 
node cluster
            Reporter: Keith Turner
            Assignee: John Vines
             Fix For: 1.4.0



The bulk random walk test failed while running on a 10 node cluster w/ the 
following error message.

{noformat}
18 23:36:05,167 [bulk.Setup] INFO : Starting bulk test on 459a04a0


19 00:24:33,950 [randomwalk.Framework] ERROR: Error during random walk
java.lang.Exception: Error running node Bulk.xml
        at 
org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:253)
        at 
org.apache.accumulo.server.test.randomwalk.Framework.run(Framework.java:61)
        at 
org.apache.accumulo.server.test.randomwalk.Framework.main(Framework.java:114)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.accumulo.start.Main$1.run(Main.java:89)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.Exception: Error running node bulk.Verify
        at 
org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:253)
        at 
org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:249)
        ... 8 more
Caused by: java.lang.Exception: Bad key at r0d646 cf:000 [] 1326932285943 false 
-1
        at 
org.apache.accumulo.server.test.randomwalk.bulk.Verify.visit(Verify.java:51)
        at 
org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:249)
        ... 9 more
{noformat}

Looking at the table the rows [r0d646, r0edd9] and [r0f056, r10467] all had -1 
values.  There was a tablet that overlapped the first range of -1 rows exactly 
268;r0edd9;r0d645.  This tablet had only the following activity on a tablet 
server and was then merged out of existence.  The merge operation was 
268;r10eff;r093b1.

{noformat}
19 00:05:10,966 [tabletserver.Tablet] DEBUG: Files for low split 
268;r0edd9;r0d645  [/b-0001azp/I0001azt.rf, /b-0001azp/I0001azu.rf, 
/t-0001ale/A0001an3.rf]
19 00:05:10,974 [tabletserver.Tablet] TABLET_HIST: 268;r0f055;r0d645 split 
268;r0edd9;r0d645 268;r0f055;r0edd9
19 00:05:10,975 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 opened 
19 00:05:15,029 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 import 
/b-0001azi/I0001azm.rf 17138 0
19 00:05:15,103 [tabletserver.Tablet] DEBUG: Starting MajC 268;r0edd9;r0d645 
[/b-0001azi/I0001azm.rf, /b-0001azp/I0001azt.rf, /b-0001azp/I0001azu.rf, 
/t-0001ale/A0001an3.rf] --> /t-0001apj/A0001bri.rf_tmp
19 00:05:15,339 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 import 
/b-0001azx/I0001azy.rf 16620 0
19 00:05:15,651 [tabletserver.Compactor] DEBUG: Compaction 268;r0edd9;r0d645 
181,080 read | 60,360 written | 553,761 entries/sec |  0.327 secs
19 00:05:15,661 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 MajC 
[/b-0001azi/I0001azm.rf, /b-0001azp/I0001azt.rf, /b-0001azp/I0001azu.rf, 
/t-0001ale/A0001an3.rf] --> /t-0001apj/A0001bri.rf
19 00:05:30,672 [tabletserver.Tablet] DEBUG: Starting MajC 268;r0edd9;r0d645 
[/b-0001azx/I0001azy.rf] --> /t-0001apj/C0001brn.rf_tmp
19 00:05:30,810 [tabletserver.Compactor] DEBUG: Compaction 268;r0edd9;r0d645 
60,360 read | 60,360 written | 534,159 entries/sec |  0.113 secs
19 00:05:30,824 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 MajC 
[/b-0001azx/I0001azy.rf] --> /t-0001apj/C0001brn.rf
19 00:05:30,943 [tabletserver.Tablet] DEBUG: initiateClose(saveState=true 
queueMinC=false disableWrites=false) 268;r0edd9;r0d645
19 00:05:30,943 [tabletserver.Tablet] DEBUG: completeClose(saveState=true 
completeClose=true) 268;r0edd9;r0d645
19 00:05:30,947 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 closed
19 00:05:30,947 [tabletserver.TabletServer] DEBUG: Unassigning 
268;r0edd9;r0d645@(null,xxx.xxx.xxx.xxx:9997[134d7425fc59413],null)
19 00:05:30,949 [tabletserver.TabletServer] INFO : unloaded 268;r0edd9;r0d645
19 00:05:30,949 [tabletserver.TabletServer] INFO : unloaded 268;r0edd9;r0d645

{noformat}


For the second range of -1 values [r0f056, r10467], r0f056 corresponds to the 
split point r0f055.  Howerver, there is no split point corresponding to r10467. 
All of the tablets w/ a split of r0f055 lived on one tablet server.  

{noformat}
19 00:02:21,262 [tabletserver.Tablet] TABLET_HIST: 268<;r0d645 split 
268;r0f055;r0d645 268<;r0f055
19 00:02:21,263 [tabletserver.Tablet] TABLET_HIST: 268;r0f055;r0d645 opened 
19 00:02:21,264 [tabletserver.Tablet] TABLET_HIST: 268<;r0f055 opened 
19 00:02:44,504 [tabletserver.Tablet] TABLET_HIST: 268<;r0f055 split 
268;r11da6;r0f055 268<;r11da6
19 00:02:44,505 [tabletserver.Tablet] TABLET_HIST: 268;r11da6;r0f055 opened 
19 00:05:10,974 [tabletserver.Tablet] TABLET_HIST: 268;r0f055;r0d645 split 
268;r0edd9;r0d645 268;r0f055;r0edd9
19 00:05:10,975 [tabletserver.Tablet] TABLET_HIST: 268;r0f055;r0edd9 opened 
19 00:05:15,023 [tabletserver.Tablet] TABLET_HIST: 268;r11da6;r0f055 split 
268;r0f622;r0f055 268;r11da6;r0f622
19 00:05:15,024 [tabletserver.Tablet] TABLET_HIST: 268;r0f622;r0f055 opened 
{noformat}

All of the tablets mentioned so far were all merged away in the same merge 
operation, making this operation a possible place were data loss occurred.  
However, I can not pinpoint the issue at this point in time.  Below is a little 
info about the merge from the master logs showing which tablets were involved 
in the merge.

{noformat}
19 00:05:30,616 [master.EventCoordinator] INFO : Merge state of 
268;r10eff;r093b1 set to WAITING_FOR_CHOPPED
19 00:05:30,677 [master.Master] INFO : Asking 
xxx.xxx.xxx.xxx:9997[134d7425fc5940c] to chop 268;r09927;r0903a
19 00:05:30,678 [master.Master] INFO : Asking 
xxx.xxx.xxx.xxx:9997[134d7425fc5940c] to chop 268;r0ca9e;r09927
19 00:05:30,678 [master.Master] INFO : Asking 
xxx.xxx.xxx.xxx:9997[134d7425fc5940a] to chop 268;r0d2b5;r0ca9e
19 00:05:30,678 [master.Master] INFO : Asking 
xxx.xxx.xxx.xxx:9997[134d7425fc59412] to chop 268;r0d645;r0d2b5
19 00:05:30,678 [master.Master] INFO : Asking 
xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r0edd9;r0d645
19 00:05:30,678 [master.Master] INFO : Asking 
xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r0f055;r0edd9
19 00:05:30,678 [master.Master] INFO : Asking 
xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r0f622;r0f055
19 00:05:30,678 [master.Master] INFO : Asking 
xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r0f68b;r0f622
19 00:05:30,678 [master.Master] INFO : Asking 
xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r10c14;r0f68b
19 00:05:30,678 [master.Master] INFO : Asking 
xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r110f7;r10c14
{noformat}

When this test verifies its data and detects data loss, there is no easy way to 
determine at what time the data loss occurred.  It might be useful to modify 
the data in the bulk test such that it is easier to determine the time when 
data was lost.  For example the continuous ingest test creates linked list and 
it is possible to determine tight time bounds when a node was ingested.  
However that may change the nature of this test and the bugs that it might find.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (ACCUMULO-334) Bulk random walk test failed

Reply via email to