[ 
https://issues.apache.org/jira/browse/HBASE-22072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16818898#comment-16818898
 ] 

ramkrishna.s.vasudevan commented on HBASE-22072:
------------------------------------------------

Created a patch that now creates a closeLock. I checked the code where 
close(false) happens when the current scanner thread sees there is no data to 
retrieve. And finally the close(true) will any way happen wthen the scan 
finishes the complete fetch of data and happens at the RegionScanner level. 
So it is the updateReaders and the close(true) call that may have happened 
asynchronously leading to the case that [~pKirillov]has mentioned here.
bq.notice flushedstoreFileScanners is an ArrayList, neither volatile no a 
threadsafe one. Rarely thread, that closes StoreScanner right after flusher 
thread executed StoreScanner.updateReaders may not see changes in 
flushedstoreFileScanners list and keeps unclosed scanner.
This am not sure. Declaring the flushedstoreFileScanners  as volatile is only 
ensuring the reference to be volatile but the contents of the list since in 
this patch we do with a lock i think the thread doing the close() and the 
thread doing updateReaders() should anyway be seeing the updated contents of 
the flushedstoreFileScanners  list. 
[~pKirillov]
Can you see this patch and give your comments? If you feel this is good can you 
try it in your cluster to see if the problem that you said happens again?

> High read/write intensive regions may cause long crash recovery
> ---------------------------------------------------------------
>
>                 Key: HBASE-22072
>                 URL: https://issues.apache.org/jira/browse/HBASE-22072
>             Project: HBase
>          Issue Type: Bug
>          Components: Performance, Recovery
>    Affects Versions: 2.1.2
>            Reporter: Pavel
>            Priority: Major
>         Attachments: HBASE-22072.HBASE-21879-v1.patch
>
>
> Compaction of high read loaded region may leave compacted files undeleted 
> because of existing scan references:
> INFO org.apache.hadoop.hbase.regionserver.HStore - Can't archive compacted 
> file hdfs://hdfs-ha/hbase... because of either isCompactedAway=true or file 
> has reference, isReferencedInReads=true, refCount=1, skipping for now
> If region is either high write loaded this happens quite often and region may 
> have few storefiles and tons of undeleted compacted hdfs files.
> Region keeps all that files (in my case thousands) untill graceful region 
> closing procedure, which ignores existing references and drop obsolete files. 
> It works fine unless consuming some extra hdfs space, but only in case of 
> normal region closing. If region server crashes than new region server, 
> responsible for that overfiling region, reads hdfs folder and try to deal 
> with all undeleted files, producing tons of storefiles, compaction tasks and 
> consuming abnormal amount of memory, wich may lead to OutOfMemory Exception 
> and further region servers crash. This stops writing to region because number 
> of storefiles reach *hbase.hstore.blockingStoreFiles* limit, forces high GC 
> duty and may take hours to compact all files into working set of files.
> Workaround is a periodically check hdfs folders files count and force region 
> assign for ones with too many files.
> It could be nice if regionserver had a setting similar to 
> hbase.hstore.blockingStoreFiles and invoke attempt to drop undeleted 
> compacted files if number of files reaches this setting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to