Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

Duo Zhang Mon, 24 May 2021 18:32:13 -0700

Just go ahead Josh, I haven't started to write the design doc yet.

Thank you for your help!


Josh Elser <[email protected]> 于2021年5月25日周二 上午1:45写道：

> Without completely opening Pandora's box, I will say we definitely have
> multiple ways we can solve the metadata management for tracking (e.g. in
> meta, in some other system table, in some other system, in a per-store
> file). Each of them have pro's and con's, and each of them has "favor"
> as to what pain we've most recently felt as a project.
>
> I don't want to defer having the discussion on what the "correct" one
> should be, but I do want to point out that it's only half of the problem
> of storefile tracking.
>
> My hope is that we can make this tracking system be pluggable, such that
> we can prototype a solution that works "good enough" for now and enables
> the rest of the development work to keep moving forward.
>
> I'm happy to see so many other folks also interested in the design of
> how we store this.
>
> Could I suggest we move this discussion around the metadata storage into
> its own thread? If Duo doesn't already have a design doc started, I can
> also try to put one together this week.
>
> Does that work for you all?
>
> On 5/22/21 11:02 AM, 张铎(Duo Zhang) wrote:
> > I could put up a simple design doc for this.
> >
> > But there is still a problem, about how to do rolling upgrading.
> >
> > After we changed the behavior, the region server will write partial store
> > files directly into the data directory. For new region servers, this is
> not
> > a problem, as we will read the hfilelist file to find out the valid store
> > files.
> > But when rolling upgrading, we can not upgrade all the regionservers at
> > once, for old regionservers, they will initialize a store by listing the
> > store files, so if a new regionserver crashes when compacting and its
> > regions are assigned to old regionservers, the old regionservers will be
> in
> > trouble...
> >
> > Stack <[email protected]> 于2021年5月22日周六 下午12:14写道：
> >
> >> HBASE-24749 design and implementation had acknowledged compromises on
> >> review: e.g. adding a new 'system table' to hold store files.  I'd
> suggest
> >> the design and implementation need a revisit before we go forward; for
> >> instance, factoring for systems other than s3 as suggested above (I like
> >> the Duo list).
> >>
> >> S
> >>
> >> On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang) <[email protected]>
> >> wrote:
> >>
> >>> What about just storing the hfile list in a file? Since now S3 has
> strong
> >>> consistency, we could safely overwrite a file then I think?
> >>>
> >>> And since the hfile list file will be very small, renaming will not be
> a
> >>> big problem.
> >>>
> >>> We could write the hfile list to a file called 'hfile.list.tmp', and
> then
> >>> rename it to 'hfile.list'.
> >>>
> >>> This is safe for HDFS, and for S3, since it is not atomic, maybe we
> could
> >>> face that, the 'hfile.list' file is not there, but there is a
> >>> 'hfile.list.tmp'.
> >>>
> >>> So when opening a HStore, we first check if 'hfile.list' is there, if
> >> not,
> >>> try 'hfile.list.tmp', rename it and load it. For safety, we could write
> >> an
> >>> initial hfile list file with no hfiles. So if we can not load either
> >>> 'hfile.list' or 'hfile.list.tmp', then we know something is wrong so
> >> users
> >>> should try to fix  it with HBCK.
> >>> And in HBCK, we will do a listing and generate the 'hfile.list' file.
> >>>
> >>> WDYT?
> >>>
> >>> Thanks.
> >>>
> >>> Wellington Chevreuil <[email protected]> 于2021年5月19日周三
> >>> 下午10:43写道：
> >>>
> >>>> Thank you, Andrew and Duo,
> >>>>
> >>>> Talking internally with Josh Elser, initial idea was to rebase the
> >>> feature
> >>>> branch with master (in order to catch with latest commits), then focus
> >> on
> >>>> work to have a minimal functioning hbase, in other words, together
> with
> >>> the
> >>>> already committed work from HBASE-25391, make sure flush, compactions,
> >>>> splits and merges all can take advantage of the persistent store file
> >>>> manager and complete with no need to rely on renames. These all map to
> >>> the
> >>>> substasks HBASE-25391, HBASE-25392 and HBASE-25393. Once we could test
> >>> and
> >>>> validate this works well for our goals, we can then focus on
> snapshots,
> >>>> bulkloading and tooling.
> >>>>
> >>>> S3 now supports strong consistency, and I heard that they are also
> >>>>> implementing atomic renaming currently, so maybe that's one of the
> >>>> reasons
> >>>>> why the development is silent now..
> >>>>>
> >>>> Interesting, I had no idea this was being implemented. I know,
> >> however, a
> >>>> version of this feature is already available on latest EMR releases
> (at
> >>>> least from 6.2.0), and AWS team has published their own blog post with
> >>>> their results:
> >>>>
> >>>>
> >>>
> >>
> https://aws.amazon.com/blogs/big-data/amazon-emr-6-2-0-adds-persistent-hfile-tracking-to-improve-performance-with-hbase-on-amazon-s3/
> >>>>
> >>>> But I do not think store hfile list in meta is the only solution. It
> >> will
> >>>>> cause cyclic dependencies for hbase:meta, and then force us a have a
> >>>>> fallback solution which makes the code a bit ugly. We should try to
> >> see
> >>>> if
> >>>>> this could be done with only the FileSystem.
> >>>>>
> >>>> This is indeed a relevant concern. One idea I had mentioned in the
> >>> original
> >>>> design doc was to track committed/non-committed files through xattr
> (or
> >>>> tags), which may have its own performance issues as explained by
> >> Stephen
> >>>> Wu, but is something that could be attempted.
> >>>>
> >>>> Em qua., 19 de mai. de 2021 às 04:56, 张铎(Duo Zhang) <
> >>> [email protected]
> >>>>>
> >>>> escreveu:
> >>>>
> >>>>> S3 now supports strong consistency, and I heard that they are also
> >>>>> implementing atomic renaming currently, so maybe that's one of the
> >>>> reasons
> >>>>> why the development is silent now...
> >>>>>
> >>>>> For me, I also think deploying hbase on cloud storage is the future,
> >>> so I
> >>>>> would also like to participate here.
> >>>>>
> >>>>> But I do not think store hfile list in meta is the only solution. It
> >>> will
> >>>>> cause cyclic dependencies for hbase:meta, and then force us a have a
> >>>>> fallback solution which makes the code a bit ugly. We should try to
> >> see
> >>>> if
> >>>>> this could be done with only the FileSystem.
> >>>>>
> >>>>> Thanks.
> >>>>>
> >>>>> Andrew Purtell <[email protected]> 于2021年5月19日周三 上午8:04写道：
> >>>>>
> >>>>>> Wellington (and et. al),
> >>>>>>
> >>>>>> S3 is also an important piece of our future production plans.
> >>>>>> Unfortunately,  we were unable to assist much with last year's
> >> work,
> >>> on
> >>>>>> account of being sidetracked by more immediate concerns.
> >> Fortunately,
> >>>>> this
> >>>>>> renewed interest is timely in that we have an HBase 2 project
> >> where,
> >>> if
> >>>>>> this can land in a 2.5 or a 2.6, it could be an important cost to
> >>> serve
> >>>>>> optimization, and one we could and would make use of. Therefore I
> >>> would
> >>>>>> like to restate my employer's interest in this work too. It may
> >> just
> >>> be
> >>>>>> Viraj and myself in the early days.
> >>>>>>
> >>>>>> I'm not sure how best to collaborate. We could review changes from
> >>> the
> >>>>>> original authors, new changes, and/or divide up the development
> >>> tasks.
> >>>> We
> >>>>>> can certainly offer our time for testing, and can afford the costs
> >> of
> >>>>>> testing against the S3 service.
> >>>>>>
> >>>>>>
> >>>>>> On Tue, May 18, 2021 at 12:16 PM Wellington Chevreuil <
> >>>>>> [email protected]> wrote:
> >>>>>>
> >>>>>>> Greetings everyone,
> >>>>>>>
> >>>>>>> HBASE-24749 has been proposed almost a year ago, introducing a
> >> new
> >>>>>>> StoreFile tracker as a way to allow for any hbase hfile
> >>> modifications
> >>>>> to
> >>>>>> be
> >>>>>>> safely completed without needing a file system rename. This seems
> >>>>> pretty
> >>>>>>> relevant for deployments over S3 file systems, where rename
> >>>> operations
> >>>>>> are
> >>>>>>> not atomic and can have a performance degradation when multiple
> >>>>> requests
> >>>>>>> get concurrently submitted to the same bucket. We had done
> >>>> superficial
> >>>>>>> tests and ycsb runs, where individual renames of files larger
> >> than
> >>>> 5GB
> >>>>>> can
> >>>>>>> take a few hundreds of seconds to complete. We also observed
> >>> impacts
> >>>> in
> >>>>>>> write loads throughput, the bottleneck potentially being the
> >>> renames.
> >>>>>>>
> >>>>>>> With S3 being an important piece of my employer cloud solution,
> >> we
> >>>>> would
> >>>>>>> like to help it move forward. We plan to contribute new patches
> >> per
> >>>> the
> >>>>>>> original design/Jira, but we’d also be happy to review changes
> >> from
> >>>> the
> >>>>>>> original authors, too. Please let us know if anyone has any
> >>> concerns,
> >>>>>>> otherwise we’ll start to self-assign issues on HBASE-24749
> >>>>>>>
> >>>>>>> Wellington
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Best regards,
> >>>>>> Andrew
> >>>>>>
> >>>>>> Words like orphans lost among the crosstalk, meaning torn from
> >>> truth's
> >>>>>> decrepit hands
> >>>>>>     - A23, Crosstalk
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>

Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

Reply via email to