[jira] [Comment Edited] (HBASE-27826) Region split and merge time while offline is O(n) with respect to number of store files

Andrew Kyle Purtell (Jira) Thu, 21 Mar 2024 23:41:04 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-27826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829754#comment-17829754
 ]


Andrew Kyle Purtell edited comment on HBASE-27826 at 3/22/24 6:40 AM:
----------------------------------------------------------------------

{quote}Links and back references should also be encoded into the tracker file 
directly, but this is not fully covered by the design doc
{quote}
This issue really should cover link and ref files too, because it is the time 
required to create them all when we have a lot of store files in the region 
that cause splits to be in offline state for a very long time on S3. If they 
are still real files it will take S3 about a second to create each one. We 
could perhaps create them in parallel with a thread pool but that is a 
workaround not a solution. This is the main pain point.

Agree the solution for link and back references is not fully covered by the 
design doc yet. So we will update the design doc to add the missing coverage. 
There are some issues to work out related to supporting rollback in particular. 
Migration will need to update both the new virtual entries in the manifest and 
real "files" in the filesystem or bucket until the user decides they can fully 
switch over, or we design a way for it to be safe and automatic.

bq. File a new issue, to add a version field in the tracker file defination, so 
when we find out that we are reading a tracker file with a higher version, we 
will fail so end users will know that they should do something before 
downgrading.

This one we can surely do now in a subtask, agreed.


was (Author: apurtell):
{quote}Links and back references should also be encoded into the tracker file 
directly, but this is not fully covered by the design doc
{quote}
This issue really should cover link and ref files too, because it is the time 
required to create them all when we have a lot of store files in the region 
that cause splits to be in offline state for a very long time on S3. If they 
are still real files it will take S3 about a second to create each one. We 
could perhaps create them in parallel with a thread pool but that is a 
workaround not a solution. This is the main pain point.

Agree the solution for link and back references is not fully covered by the 
design doc yet. So we will update the design doc to add the missing coverage. 
There are some issues to work out related to supporting rollback in particular. 
Migration will need to update both the new virtual entries in the manifest and 
real "files" in the filesystem or bucket until the user decides they can fully 
switch over, or we design a way for it to be safe and automatic.

 

> Region split and merge time while offline is O(n) with respect to number of 
> store files
> ---------------------------------------------------------------------------------------
>
>                 Key: HBASE-27826
>                 URL: https://issues.apache.org/jira/browse/HBASE-27826
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.5.4
>            Reporter: Andrew Kyle Purtell
>            Priority: Major
>
> This is a significant availability issue when HFiles are on S3. =
> HBASE-26079 ({_}Use StoreFileTracker when splitting and merging{_}) changed 
> the split and merge table procedure implementations to indirect through the 
> StoreFileTracker implementation when selecting HFiles to be merged or split, 
> rather than directly listing those using file system APIs. It also changed 
> the commit logic in HRegionFileSystem to add the link/ref files on resulting 
> split or merged regions to the StoreFileTracker. However, the creation of a 
> link file is still a filesystem operation and creating a “file” on S3 can 
> take well over a second. If, for example there are 20 store files in a 
> region, which is not uncommon, after the region is taken offline for a split 
> (or merge) it may require more than 20 seconds to create the link files 
> before the results can be brought back online, creating a severe availability 
> problem. Splits and merges are supposed to be fast, completing in less than a 
> second, certainly less than a few seconds. This has been true when HFiles are 
> stored on HDFS only because file creation operations there are nearly 
> instantaneous. 
> There are two issues but both can be handled with modifications to the store 
> file tracker interface and the file based store file tracker implementation. 
> When the file based store file file tracker is enabled the HFile links should 
> be virtual entities that only exist in the file manifest. We do not require 
> physical files in the filesystem to serve as links now. That is the magic of 
> the this file tracker, the manifest file replaces requirements to list the 
> filesystem.
> Then, when splitting or merging, the HFile links should be collected into a 
> list and committed in one batch using a new FILE file tracker interface, 
> requiring only one update of the manifest file in S3, bringing the time 
> requirement for this operation to O(1) down from O[n].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (HBASE-27826) Region split and merge time while offline is O(n) with respect to number of store files

Reply via email to