Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

Josh Elser Mon, 24 May 2021 10:45:45 -0700

Without completely opening Pandora's box, I will say we definitely havemultiple ways we can solve the metadata management for tracking (e.g. inmeta, in some other system table, in some other system, in a per-storefile). Each of them have pro's and con's, and each of them has "favor"as to what pain we've most recently felt as a project.

I don't want to defer having the discussion on what the "correct" oneshould be, but I do want to point out that it's only half of the problemof storefile tracking.

My hope is that we can make this tracking system be pluggable, such thatwe can prototype a solution that works "good enough" for now and enablesthe rest of the development work to keep moving forward.

I'm happy to see so many other folks also interested in the design ofhow we store this.

Could I suggest we move this discussion around the metadata storage intoits own thread? If Duo doesn't already have a design doc started, I canalso try to put one together this week.


Does that work for you all?

On 5/22/21 11:02 AM, 张铎(Duo Zhang) wrote:

I could put up a simple design doc for this.

But there is still a problem, about how to do rolling upgrading.

After we changed the behavior, the region server will write partial store
files directly into the data directory. For new region servers, this is not
a problem, as we will read the hfilelist file to find out the valid store
files.
But when rolling upgrading, we can not upgrade all the regionservers at
once, for old regionservers, they will initialize a store by listing the
store files, so if a new regionserver crashes when compacting and its
regions are assigned to old regionservers, the old regionservers will be in
trouble...

Stack <[email protected]> 于2021年5月22日周六 下午12:14写道：

HBASE-24749 design and implementation had acknowledged compromises on
review: e.g. adding a new 'system table' to hold store files.  I'd suggest
the design and implementation need a revisit before we go forward; for
instance, factoring for systems other than s3 as suggested above (I like
the Duo list).

S

On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang) <[email protected]>
wrote:

What about just storing the hfile list in a file? Since now S3 has strong
consistency, we could safely overwrite a file then I think?

And since the hfile list file will be very small, renaming will not be a
big problem.

We could write the hfile list to a file called 'hfile.list.tmp', and then
rename it to 'hfile.list'.

This is safe for HDFS, and for S3, since it is not atomic, maybe we could
face that, the 'hfile.list' file is not there, but there is a
'hfile.list.tmp'.

So when opening a HStore, we first check if 'hfile.list' is there, if

not,

try 'hfile.list.tmp', rename it and load it. For safety, we could write

an

initial hfile list file with no hfiles. So if we can not load either
'hfile.list' or 'hfile.list.tmp', then we know something is wrong so

users

should try to fix  it with HBCK.
And in HBCK, we will do a listing and generate the 'hfile.list' file.

WDYT?

Thanks.

Wellington Chevreuil <[email protected]> 于2021年5月19日周三
下午10:43写道：

Thank you, Andrew and Duo,

Talking internally with Josh Elser, initial idea was to rebase the

feature

branch with master (in order to catch with latest commits), then focus

on

work to have a minimal functioning hbase, in other words, together with

the

already committed work from HBASE-25391, make sure flush, compactions,
splits and merges all can take advantage of the persistent store file
manager and complete with no need to rely on renames. These all map to

the

substasks HBASE-25391, HBASE-25392 and HBASE-25393. Once we could test

and

validate this works well for our goals, we can then focus on snapshots,
bulkloading and tooling.

S3 now supports strong consistency, and I heard that they are also

implementing atomic renaming currently, so maybe that's one of the

reasons

why the development is silent now..

Interesting, I had no idea this was being implemented. I know,

however, a

version of this feature is already available on latest EMR releases (at
least from 6.2.0), and AWS team has published their own blog post with
their results:

https://aws.amazon.com/blogs/big-data/amazon-emr-6-2-0-adds-persistent-hfile-tracking-to-improve-performance-with-hbase-on-amazon-s3/


But I do not think store hfile list in meta is the only solution. It

will

cause cyclic dependencies for hbase:meta, and then force us a have a
fallback solution which makes the code a bit ugly. We should try to

see

if

this could be done with only the FileSystem.

This is indeed a relevant concern. One idea I had mentioned in the

original

design doc was to track committed/non-committed files through xattr (or
tags), which may have its own performance issues as explained by

Stephen

Wu, but is something that could be attempted.

Em qua., 19 de mai. de 2021 às 04:56, 张铎(Duo Zhang) <

[email protected]

escreveu:

S3 now supports strong consistency, and I heard that they are also
implementing atomic renaming currently, so maybe that's one of the

reasons

why the development is silent now...

For me, I also think deploying hbase on cloud storage is the future,

so I

would also like to participate here.

But I do not think store hfile list in meta is the only solution. It

will

cause cyclic dependencies for hbase:meta, and then force us a have a
fallback solution which makes the code a bit ugly. We should try to

see

if

this could be done with only the FileSystem.

Thanks.

Andrew Purtell <[email protected]> 于2021年5月19日周三 上午8:04写道：

Wellington (and et. al),

S3 is also an important piece of our future production plans.
Unfortunately,  we were unable to assist much with last year's

work,

on

account of being sidetracked by more immediate concerns.

Fortunately,

this

renewed interest is timely in that we have an HBase 2 project

where,

if

this can land in a 2.5 or a 2.6, it could be an important cost to

serve

optimization, and one we could and would make use of. Therefore I

would

like to restate my employer's interest in this work too. It may

just

be

Viraj and myself in the early days.

I'm not sure how best to collaborate. We could review changes from

the

original authors, new changes, and/or divide up the development

tasks.

We

can certainly offer our time for testing, and can afford the costs

of

testing against the S3 service.


On Tue, May 18, 2021 at 12:16 PM Wellington Chevreuil <
[email protected]> wrote:

Greetings everyone,

HBASE-24749 has been proposed almost a year ago, introducing a

new

StoreFile tracker as a way to allow for any hbase hfile

modifications

to

be

safely completed without needing a file system rename. This seems

pretty

relevant for deployments over S3 file systems, where rename

operations

are

not atomic and can have a performance degradation when multiple

requests

get concurrently submitted to the same bucket. We had done

superficial

tests and ycsb runs, where individual renames of files larger

than

5GB

can

take a few hundreds of seconds to complete. We also observed

impacts

in

write loads throughput, the bottleneck potentially being the

renames.


With S3 being an important piece of my employer cloud solution,

we

would

like to help it move forward. We plan to contribute new patches

per

the

original design/Jira, but we’d also be happy to review changes

from

the

original authors, too. Please let us know if anyone has any

concerns,

otherwise we’ll start to self-assign issues on HBASE-24749

Wellington



--
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from

truth's

decrepit hands
    - A23, Crosstalk

Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

Reply via email to