[ 
https://issues.apache.org/jira/browse/HBASE-27826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17825746#comment-17825746
 ] 

Prathyusha commented on HBASE-27826:
------------------------------------

[~zhangduo]  to give a brief, we did a couple of experiments recently (without 
much load) to quantify the benefit we could get with this change and we noticed 
it should lessen total split time by 3-5 seconds atleast. 

At a high level our current plan is to

1. keep the abstraction of Reference being an actual physical file or not at 
SFT layer. So DefaultStoreFileTracker creates a physical file while 
FileBasedStoreFileTracker updates the manifest (f1/f2) file.

2. So for above, we want to refactor/move the creation/access of Ref/link files 
to StoreFileTracker interface. Currently everywhere we use FS object to create 
these Reference files, which we now want to make this go via SFT interface. 
Also at many places we do a list operation on storeDir using Hadoop FileSystem 
object to get all store files, this also needs to go via SFT.
(so we started on this refactor first)

3. Once above is done, add the implementation in FileBasedStoreFileTracker to 
update the manifest file.

4. Post that similar to what you mentioned above, refactor [splitStoreFiles 
code|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/SplitTableRegionProcedure.java#L657]
 to create the expectedReferences metadata in memory and DefautStoreFileTracker 
creates physical files and FileBasedStoreFileTracker updates manifest file in 1 
go. 

All of above is open for discussion, look forward to learn/work with you :) 
cc [~apurtell] 

> Region split and merge time while offline is O(n) with respect to number of 
> store files
> ---------------------------------------------------------------------------------------
>
>                 Key: HBASE-27826
>                 URL: https://issues.apache.org/jira/browse/HBASE-27826
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.5.4
>            Reporter: Andrew Kyle Purtell
>            Priority: Major
>
> This is a significant availability issue when HFiles are on S3. =
> HBASE-26079 ({_}Use StoreFileTracker when splitting and merging{_}) changed 
> the split and merge table procedure implementations to indirect through the 
> StoreFileTracker implementation when selecting HFiles to be merged or split, 
> rather than directly listing those using file system APIs. It also changed 
> the commit logic in HRegionFileSystem to add the link/ref files on resulting 
> split or merged regions to the StoreFileTracker. However, the creation of a 
> link file is still a filesystem operation and creating a “file” on S3 can 
> take well over a second. If, for example there are 20 store files in a 
> region, which is not uncommon, after the region is taken offline for a split 
> (or merge) it may require more than 20 seconds to create the link files 
> before the results can be brought back online, creating a severe availability 
> problem. Splits and merges are supposed to be fast, completing in less than a 
> second, certainly less than a few seconds. This has been true when HFiles are 
> stored on HDFS only because file creation operations there are nearly 
> instantaneous. 
> There are two issues but both can be handled with modifications to the store 
> file tracker interface and the file based store file tracker implementation. 
> When the file based store file file tracker is enabled the HFile links should 
> be virtual entities that only exist in the file manifest. We do not require 
> physical files in the filesystem to serve as links now. That is the magic of 
> the this file tracker, the manifest file replaces requirements to list the 
> filesystem.
> Then, when splitting or merging, the HFile links should be collected into a 
> list and committed in one batch using a new FILE file tracker interface, 
> requiring only one update of the manifest file in S3, bringing the time 
> requirement for this operation to O(1) down from O[n].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to