[
https://issues.apache.org/jira/browse/ACCUMULO-314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Keith Turner resolved ACCUMULO-314.
-----------------------------------
Resolution: Fixed
> Re-queue tablets immediately after major compaction if there is more work
> --------------------------------------------------------------------------
>
> Key: ACCUMULO-314
> URL: https://issues.apache.org/jira/browse/ACCUMULO-314
> Project: Accumulo
> Issue Type: Bug
> Components: tserver
> Affects Versions: 1.3.5
> Environment: 1.4.0-SNAPSHOT on 10 node cluster
> Reporter: Keith Turner
> Assignee: Keith Turner
> Fix For: 1.4.0
>
>
> While running the random walk test, I noticed the shard test was running
> slowly sometimes thanks to ACCUMULO-273.
> {noformat}
> 13 19:56:47,284 [shard.Merge] DEBUG: merging ST_index_6389_1326478898465
> 13 19:56:52,543 [shard.Insert] DEBUG: Inserted document ac64000000000000
> 13 19:56:54,016 [shard.Commit] DEBUG: Committed inserts
> 13 19:56:54,019 [shard.Insert] DEBUG: Inserted document bc64000000000000
> 13 19:56:54,020 [shard.Insert] DEBUG: Inserted document cc64000000000000
> 13 19:56:54,021 [shard.Insert] DEBUG: Inserted document dc64000000000000
> 13 19:56:54,022 [shard.Insert] DEBUG: Inserted document ec64000000000000
> 13 19:56:54,023 [shard.Insert] DEBUG: Inserted document fc64000000000000
> 13 19:56:54,025 [shard.Insert] DEBUG: Inserted document 0d64000000000000
> 13 19:56:54,026 [shard.Insert] DEBUG: Inserted document 1d64000000000000
> 13 19:56:54,055 [shard.Commit] DEBUG: Committed inserts
> 13 19:56:54,068 [shard.Search] DEBUG: Looking up terms [154l, 1kzi] expect to
> find 9ee0000000000000
> 13 20:01:54,102 [randomwalk.Module] WARN : Node
> org.apache.accumulo.server.test.randomwalk.shard.Search has been running for
> 300.0 seconds. You may want to look into it.
> 13 20:05:52,530 [randomwalk.Module] WARN : Node
> org.apache.accumulo.server.test.randomwalk.shard.Search, which was running
> long, has now completed after 538.475 seconds
> {noformat}
> I noticed a merge usually preceded the slow lookups. I looked the the master
> logs and saw that the merge finished ok and saw which tablet server the
> merged tablet was assigned to. Below are some snippets from the master log
> that show the table id and tablet server.
> {noformat}
> 13 18:36:43,236 [tableOps.RenameTable] DEBUG: Renamed table 1bk
> ST_index_6389_1326478898465_tmp ST_index_6389_1326478898465
> 13 19:56:47,293 [tableOps.Utils] INFO : table 1bk (3b08cf01ba49883) locked
> for write operation: MERGE
> 13 19:56:52,496 [tableOps.Utils] INFO : table 1bk (3b08cf01ba49883) unlocked
> for write
> 13 19:56:52,504 [master.Master] DEBUG: Normal Tablets assigning tablet
> 1bk<<=xxx.xxx.xxx.xxx:9997[134d7425fc503db]
> {noformat}
> Some snippets from the tablet server logs are below and this shows the
> problem.
> {noformat}
> 13 19:56:52,522 [tabletserver.Tablet] TABLET_HIST: 1bk<< opened
> 13 19:56:54,065 [tabletserver.Tablet] WARN : Tablet 1bk<< has too many files,
> batch lookup can not run
> 13 19:57:10,383 [tabletserver.Compactor] DEBUG: Compaction 1bk<< 6,954 read |
> 6,954 written | 108,656 entries/sec | 0.064 secs
> 13 19:57:10,402 [tabletserver.Tablet] TABLET_HIST: 1bk<< MajC
> [/t-0000qzs/C0000sj3.rf, /t-0000qzt/F0000rtf.rf, /t-0000qzt/F0000s0r.rf,
> /t-0000qzz/F0000sc0.rf, /t-0000r00/F0000s0v.rf, /t-0000r0f/C0000rpu.rf,
> /t-0000r0l/C0000qqz.rf, /t-0000rqt/C0000s3m.rf, /t-0000rra/C0000sbx.rf,
> /t-0000rrh/F0000sgj.rf] --> /c-00000054/C0000soe.rf
> 13 19:57:40,534 [tabletserver.Compactor] DEBUG: Compaction 1bk<< 21,036 read
> | 21,036 written | 104,656 entries/sec | 0.201 secs
> 13 19:57:40,564 [tabletserver.Tablet] TABLET_HIST: 1bk<< MajC
> [/t-0000qzm/C0000rfa.rf, /t-0000r0l/F0000sc6.rf, /t-0000rr1/C0000rpr.rf,
> /t-0000rr4/F0000sc2.rf, /t-0000rr5/F0000rq0.rf, /t-0000rr9/F0000s0y.rf,
> /t-0000rrb/F0000sc5.rf, /t-0000rrs/F0000sc7.rf, /t-0000rs1/F0000ssf.rf,
> /t-0000rs2/F0000ssg.rf] --> /c-00000054/C0000son.rf
> {noformat}
> The problem is that the merged tablet has too many files to open, so the
> batch scan for the shard test can not run. However it takes the tablet
> server forver to work this issue out. Every 30 seconds it compacts 10 tablet
> files down to one. The compactions take a few hundred milliseconds, so it
> could be worked out much faster if the compactions occurred back to back.
> In 1.3 compactions were changed from depth first to breadth first (e.g. if a
> tablet server has 100 tablets and all have 100 files, instead of compacting
> each tablet to one file go across the tablets compacting 10 at a time until
> each tablet has one file). This change introduced this bug. There is no
> need to wait 30 seconds between compactions in this case.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira