[
https://issues.apache.org/jira/browse/HBASE-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jeff Whiting updated HBASE-2646:
--------------------------------
Attachment: 2464-fix-race-condition-r1004349.txt
As per my discussion with Ted Yu, here is a patch that addresses the potential
race condition with queue.remove(queuedRequest). It does a sanity check to
make sure that what I'm removing from regionsInQueue is what I expect otherwise
I leave it on. The patch views this.queue as the authoritative source on the
which requests to run and will always return whatever was removed from the
queue.
> Compaction requests should be prioritized to prevent blocking
> -------------------------------------------------------------
>
> Key: HBASE-2646
> URL: https://issues.apache.org/jira/browse/HBASE-2646
> Project: HBase
> Issue Type: Improvement
> Components: regionserver
> Affects Versions: 0.20.4
> Environment: ubuntu server 10; hbase 0.20.4; 4 machine cluster (each
> machine is an 8 core xeon with 16 GB of ram and 6TB of storage); ~250 Million
> rows;
> Reporter: Jeff Whiting
> Priority: Critical
> Fix For: 0.20.7
>
> Attachments: 2464-fix-race-condition-r1004349.txt, 2646-v2.txt,
> 2646-v3.txt, prioritycompactionqueue-0.20.4.patch, PriorityQueue-r996664.patch
>
>
> While testing the write capacity of a 4 machine hbase cluster we were getting
> long and frequent client pauses as we attempted to load the data. Looking
> into the problem we'd get a relatively large compaction queue and when a
> region hit the "hbase.hstore.blockingStoreFiles" limit it would get block the
> client and the compaction request would get put on the back of the queue
> waiting for many other less important compactions. The client is basically
> stuck at that point until a compaction is done. Prioritizing the compaction
> requests and allowing the request that is blocking other actions go first
> would help solve the problem.
> You can see the problem by looking at our log files:
> You'll first see an event such as a too many HLog which will put a lot of
> requests on the compaction queue.
> {noformat}
> 2010-05-25 10:53:26,570 INFO org.apache.hadoop.hbase.regionserver.HLog: Too
> many hlogs: logs=33, maxlogs=32; forcing flush of 22 regions(s):
> responseCounts,RS_6eZzLtdwhGiTwHy,1274232223324,
> responses,RS_0qhkL5rUmPCbx3K-1274213057242,1274513189592,
> responses,RS_1ANYnTegjzVIsHW-12742177419
> 21,1274511001873, responses,RS_1HQ4UG5BdOlAyuE-1274216757425,1274726323747,
> responses,RS_1Y7SbqSTsZrYe7a-1274328697838,1274478031930,
> responses,RS_1ZH5TB5OdW4BVLm-1274216239894,1274538267659,
> responses,RS_3BHc4KyoM3q72Yc-1274290546987,1274502062319,
> responses,RS_3ra9BaBMAXFAvbK-127421457
> 9958,1274381552543, responses,RS_6SDrGNuyyLd3oR6-1274219941155,1274385453586,
> responses,RS_8AGCEMWbI6mZuoQ-1274306857429,1274319602718,
> responses,RS_8C8T9DN47uwTG1S-1274215381765,1274289112817,
> responses,RS_8J5wmdmKmJXzK6g-1274299593861,1274494738952,
> responses,RS_8e5Sz0HeFPAdb6c-1274288
> 641459,1274495868557,
> responses,RS_8rjcnmBXPKzI896-1274306981684,1274403047940,
> responses,RS_9FS3VedcyrF0KX2-1274245971331,1274754745013,
> responses,RS_9oZgPtxO31npv3C-1274214027769,1274396489756,
> responses,RS_a3FdO2jhqWuy37C-1274209228660,1274399508186,
> responses,RS_a3LJVxwTj29MHVa-12742
> {noformat}
> Then you see the too many log files:
> {noformat}
> 2010-05-25 10:53:31,364 DEBUG
> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested
> for region
> responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862/783020138
> because: regionserver/192.168.0.81:60020.cacheFlusher
> 2010-05-25 10:53:32,364 WARN
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region
> responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862 has too many
> store files, putting it back at the end of the flush queue.
> {noformat}
> Which leads to this:
> {noformat}
> 2010-05-25 10:53:27,061 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> Blocking updates for 'IPC Server handler 60 on 60020' on region
> responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862: memstore
> size 128.0m is >= than blocking 128.0m size
> 2010-05-25 10:53:27,061 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> Blocking updates for 'IPC Server handler 84 on 60020' on region
> responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862: memstore
> size 128.0m is >= than blocking 128.0m size
> 2010-05-25 10:53:27,065 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> Blocking updates for 'IPC Server handler 1 on 60020' on region
> responses-index,--1274799047787--R_cBKrGxx0FdWjPso,1274804575862: memstore
> size 128.0m is >= than blocking 128.0m size
> {noformat}
> Once the compaction / split is done a flush is able to happen which unblocks
> the IPC allowing writes to continue. Unfortunately this process can take
> upwards of 15+ minutes (the specific case shown here from our logs took about
> 4 minutes).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.