[ 
https://issues.apache.org/jira/browse/ACCUMULO-314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Turner resolved ACCUMULO-314.
-----------------------------------

    Resolution: Fixed
    
> Re-queue tablets immediately after major compaction if there is more work 
> --------------------------------------------------------------------------
>
>                 Key: ACCUMULO-314
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-314
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.3.5
>         Environment: 1.4.0-SNAPSHOT on 10 node cluster
>            Reporter: Keith Turner
>            Assignee: Keith Turner
>             Fix For: 1.4.0
>
>
> While running the random walk test, I noticed the shard test was running 
> slowly sometimes thanks to ACCUMULO-273.
> {noformat}
> 13 19:56:47,284 [shard.Merge] DEBUG: merging ST_index_6389_1326478898465
> 13 19:56:52,543 [shard.Insert] DEBUG: Inserted document ac64000000000000
> 13 19:56:54,016 [shard.Commit] DEBUG: Committed inserts
> 13 19:56:54,019 [shard.Insert] DEBUG: Inserted document bc64000000000000
> 13 19:56:54,020 [shard.Insert] DEBUG: Inserted document cc64000000000000
> 13 19:56:54,021 [shard.Insert] DEBUG: Inserted document dc64000000000000
> 13 19:56:54,022 [shard.Insert] DEBUG: Inserted document ec64000000000000
> 13 19:56:54,023 [shard.Insert] DEBUG: Inserted document fc64000000000000
> 13 19:56:54,025 [shard.Insert] DEBUG: Inserted document 0d64000000000000
> 13 19:56:54,026 [shard.Insert] DEBUG: Inserted document 1d64000000000000
> 13 19:56:54,055 [shard.Commit] DEBUG: Committed inserts
> 13 19:56:54,068 [shard.Search] DEBUG: Looking up terms [154l, 1kzi] expect to 
> find 9ee0000000000000
> 13 20:01:54,102 [randomwalk.Module] WARN : Node 
> org.apache.accumulo.server.test.randomwalk.shard.Search has been running for 
> 300.0 seconds. You may want to look into it.
> 13 20:05:52,530 [randomwalk.Module] WARN : Node 
> org.apache.accumulo.server.test.randomwalk.shard.Search, which was running 
> long, has now completed after 538.475 seconds
> {noformat}
> I noticed a merge usually preceded the slow lookups.  I looked the the master 
> logs and saw that the merge finished ok and saw which tablet server the 
> merged tablet was assigned to. Below are some snippets from the master log 
> that show the table id and tablet server.
> {noformat}
> 13 18:36:43,236 [tableOps.RenameTable] DEBUG: Renamed table 1bk 
> ST_index_6389_1326478898465_tmp ST_index_6389_1326478898465
> 13 19:56:47,293 [tableOps.Utils] INFO : table 1bk (3b08cf01ba49883) locked 
> for write operation: MERGE
> 13 19:56:52,496 [tableOps.Utils] INFO : table 1bk (3b08cf01ba49883) unlocked 
> for write
> 13 19:56:52,504 [master.Master] DEBUG: Normal Tablets assigning tablet 
> 1bk<<=xxx.xxx.xxx.xxx:9997[134d7425fc503db]
> {noformat}
> Some snippets from the tablet server logs are below and this shows the 
> problem.
> {noformat}
> 13 19:56:52,522 [tabletserver.Tablet] TABLET_HIST: 1bk<< opened
> 13 19:56:54,065 [tabletserver.Tablet] WARN : Tablet 1bk<< has too many files, 
> batch lookup can not run
> 13 19:57:10,383 [tabletserver.Compactor] DEBUG: Compaction 1bk<< 6,954 read | 
> 6,954 written | 108,656 entries/sec |  0.064 secs
> 13 19:57:10,402 [tabletserver.Tablet] TABLET_HIST: 1bk<< MajC 
> [/t-0000qzs/C0000sj3.rf, /t-0000qzt/F0000rtf.rf, /t-0000qzt/F0000s0r.rf, 
> /t-0000qzz/F0000sc0.rf, /t-0000r00/F0000s0v.rf, /t-0000r0f/C0000rpu.rf, 
> /t-0000r0l/C0000qqz.rf, /t-0000rqt/C0000s3m.rf, /t-0000rra/C0000sbx.rf, 
> /t-0000rrh/F0000sgj.rf] --> /c-00000054/C0000soe.rf
> 13 19:57:40,534 [tabletserver.Compactor] DEBUG: Compaction 1bk<< 21,036 read 
> | 21,036 written | 104,656 entries/sec |  0.201 secs
> 13 19:57:40,564 [tabletserver.Tablet] TABLET_HIST: 1bk<< MajC 
> [/t-0000qzm/C0000rfa.rf, /t-0000r0l/F0000sc6.rf, /t-0000rr1/C0000rpr.rf, 
> /t-0000rr4/F0000sc2.rf, /t-0000rr5/F0000rq0.rf, /t-0000rr9/F0000s0y.rf, 
> /t-0000rrb/F0000sc5.rf, /t-0000rrs/F0000sc7.rf, /t-0000rs1/F0000ssf.rf, 
> /t-0000rs2/F0000ssg.rf] --> /c-00000054/C0000son.rf
> {noformat}
> The problem is that the merged tablet has too many files to open, so the 
> batch scan for the shard test can not run.  However it takes the tablet 
> server forver to work this issue out.  Every 30 seconds it compacts 10 tablet 
> files down to one.  The compactions take a few hundred milliseconds, so it 
> could be worked out much faster if the compactions occurred back to back.
> In 1.3 compactions were changed from depth first to breadth first (e.g. if a 
> tablet server has 100 tablets and all have 100 files, instead of compacting 
> each tablet to one file go across the tablets compacting 10 at a time until 
> each tablet has one file).  This change introduced this bug.  There is no 
> need to wait 30 seconds between compactions in this case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to