Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

Andrew Purtell Mon, 24 May 2021 09:24:39 -0700

> And for downgrading, usually we do not support downgrading from a major
version upgrading, so it is not a big problem.


You missed an earlier comment from me.

Our team requires this to be released in a branch-2 version or we can't use
it. Therefore I am not in favor of any solution that requires a major
version increment.


On Sun, May 23, 2021 at 5:43 AM 张铎(Duo Zhang) <[email protected]> wrote:

> I do not think it should be a table level config. It should be a cluster
> level config. We only have one FileSystem so it is useless to let different
> tables have different ways to store hfile list.
>
> But I think the general approach is fine. We could introduce a config for
> whether to enable 'write to data directory directly' mode. When rolling
> upgrading, the flag should be false, you can change it to true after the
> whole cluster has been upgraded.
>
> And for downgrading, usually we do not support downgrading from a major
> version upgrading, so it is not a big problem.
>
> Thanks.
>
> Andrew Purtell <[email protected]> 于2021年5月23日周日 上午12:53写道：
>
> > Put a check in the code whether hfilelist mode or original store layout
> is
> > in use and handles both cases. Then, to upgrade:
> >
> > 1. First, perform a rolling upgrade to $NEW_VERSION .
> >
> > 2. Once upgraded to $NEW_VERSION execute an alter table command that
> > enables hfilelist mode. This will cause all regions to close and reopen
> in
> > the new mode.
> >
> > Because the rolling upgrade to $NEW_VERSION is completed first a mix of
> > old and new layouts is fine, for the brief period of time when store
> > layouts are upgrading in response to the alter command, because this
> > version can handle both.
> >
> > Downgrade to an older version is not possible after the alter table
> > command, so this must be clearly documented, but of course would not be a
> > surprise to anyone, because the alter command is for switching to the new
> > store layout.
> >
> >
> > > On May 22, 2021, at 8:03 AM, 张铎 <[email protected]> wrote:
> > >
> > > I could put up a simple design doc for this.
> > >
> > > But there is still a problem, about how to do rolling upgrading.
> > >
> > > After we changed the behavior, the region server will write partial
> store
> > > files directly into the data directory. For new region servers, this is
> > not
> > > a problem, as we will read the hfilelist file to find out the valid
> store
> > > files.
> > > But when rolling upgrading, we can not upgrade all the regionservers at
> > > once, for old regionservers, they will initialize a store by listing
> the
> > > store files, so if a new regionserver crashes when compacting and its
> > > regions are assigned to old regionservers, the old regionservers will
> be
> > in
> > > trouble...
> > >
> > > Stack <[email protected]> 于2021年5月22日周六 下午12:14写道：
> > >
> > >> HBASE-24749 design and implementation had acknowledged compromises on
> > >> review: e.g. adding a new 'system table' to hold store files.  I'd
> > suggest
> > >> the design and implementation need a revisit before we go forward; for
> > >> instance, factoring for systems other than s3 as suggested above (I
> like
> > >> the Duo list).
> > >>
> > >> S
> > >>
> > >>> On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang) <[email protected]
> >
> > >>> wrote:
> > >>>
> > >>> What about just storing the hfile list in a file? Since now S3 has
> > strong
> > >>> consistency, we could safely overwrite a file then I think?
> > >>>
> > >>> And since the hfile list file will be very small, renaming will not
> be
> > a
> > >>> big problem.
> > >>>
> > >>> We could write the hfile list to a file called 'hfile.list.tmp', and
> > then
> > >>> rename it to 'hfile.list'.
> > >>>
> > >>> This is safe for HDFS, and for S3, since it is not atomic, maybe we
> > could
> > >>> face that, the 'hfile.list' file is not there, but there is a
> > >>> 'hfile.list.tmp'.
> > >>>
> > >>> So when opening a HStore, we first check if 'hfile.list' is there, if
> > >> not,
> > >>> try 'hfile.list.tmp', rename it and load it. For safety, we could
> write
> > >> an
> > >>> initial hfile list file with no hfiles. So if we can not load either
> > >>> 'hfile.list' or 'hfile.list.tmp', then we know something is wrong so
> > >> users
> > >>> should try to fix  it with HBCK.
> > >>> And in HBCK, we will do a listing and generate the 'hfile.list' file.
> > >>>
> > >>> WDYT?
> > >>>
> > >>> Thanks.
> > >>>
> > >>> Wellington Chevreuil <[email protected]> 于2021年5月19日周三
> > >>> 下午10:43写道：
> > >>>
> > >>>> Thank you, Andrew and Duo,
> > >>>>
> > >>>> Talking internally with Josh Elser, initial idea was to rebase the
> > >>> feature
> > >>>> branch with master (in order to catch with latest commits), then
> focus
> > >> on
> > >>>> work to have a minimal functioning hbase, in other words, together
> > with
> > >>> the
> > >>>> already committed work from HBASE-25391, make sure flush,
> compactions,
> > >>>> splits and merges all can take advantage of the persistent store
> file
> > >>>> manager and complete with no need to rely on renames. These all map
> to
> > >>> the
> > >>>> substasks HBASE-25391, HBASE-25392 and HBASE-25393. Once we could
> test
> > >>> and
> > >>>> validate this works well for our goals, we can then focus on
> > snapshots,
> > >>>> bulkloading and tooling.
> > >>>>
> > >>>> S3 now supports strong consistency, and I heard that they are also
> > >>>>> implementing atomic renaming currently, so maybe that's one of the
> > >>>> reasons
> > >>>>> why the development is silent now..
> > >>>>>
> > >>>> Interesting, I had no idea this was being implemented. I know,
> > >> however, a
> > >>>> version of this feature is already available on latest EMR releases
> > (at
> > >>>> least from 6.2.0), and AWS team has published their own blog post
> with
> > >>>> their results:
> > >>>>
> > >>>>
> > >>>
> > >>
> >
> https://aws.amazon.com/blogs/big-data/amazon-emr-6-2-0-adds-persistent-hfile-tracking-to-improve-performance-with-hbase-on-amazon-s3/
> > >>>>
> > >>>> But I do not think store hfile list in meta is the only solution. It
> > >> will
> > >>>>> cause cyclic dependencies for hbase:meta, and then force us a have
> a
> > >>>>> fallback solution which makes the code a bit ugly. We should try to
> > >> see
> > >>>> if
> > >>>>> this could be done with only the FileSystem.
> > >>>>>
> > >>>> This is indeed a relevant concern. One idea I had mentioned in the
> > >>> original
> > >>>> design doc was to track committed/non-committed files through xattr
> > (or
> > >>>> tags), which may have its own performance issues as explained by
> > >> Stephen
> > >>>> Wu, but is something that could be attempted.
> > >>>>
> > >>>> Em qua., 19 de mai. de 2021 às 04:56, 张铎(Duo Zhang) <
> > >>> [email protected]
> > >>>>>
> > >>>> escreveu:
> > >>>>
> > >>>>> S3 now supports strong consistency, and I heard that they are also
> > >>>>> implementing atomic renaming currently, so maybe that's one of the
> > >>>> reasons
> > >>>>> why the development is silent now...
> > >>>>>
> > >>>>> For me, I also think deploying hbase on cloud storage is the
> future,
> > >>> so I
> > >>>>> would also like to participate here.
> > >>>>>
> > >>>>> But I do not think store hfile list in meta is the only solution.
> It
> > >>> will
> > >>>>> cause cyclic dependencies for hbase:meta, and then force us a have
> a
> > >>>>> fallback solution which makes the code a bit ugly. We should try to
> > >> see
> > >>>> if
> > >>>>> this could be done with only the FileSystem.
> > >>>>>
> > >>>>> Thanks.
> > >>>>>
> > >>>>> Andrew Purtell <[email protected]> 于2021年5月19日周三 上午8:04写道：
> > >>>>>
> > >>>>>> Wellington (and et. al),
> > >>>>>>
> > >>>>>> S3 is also an important piece of our future production plans.
> > >>>>>> Unfortunately,  we were unable to assist much with last year's
> > >> work,
> > >>> on
> > >>>>>> account of being sidetracked by more immediate concerns.
> > >> Fortunately,
> > >>>>> this
> > >>>>>> renewed interest is timely in that we have an HBase 2 project
> > >> where,
> > >>> if
> > >>>>>> this can land in a 2.5 or a 2.6, it could be an important cost to
> > >>> serve
> > >>>>>> optimization, and one we could and would make use of. Therefore I
> > >>> would
> > >>>>>> like to restate my employer's interest in this work too. It may
> > >> just
> > >>> be
> > >>>>>> Viraj and myself in the early days.
> > >>>>>>
> > >>>>>> I'm not sure how best to collaborate. We could review changes from
> > >>> the
> > >>>>>> original authors, new changes, and/or divide up the development
> > >>> tasks.
> > >>>> We
> > >>>>>> can certainly offer our time for testing, and can afford the costs
> > >> of
> > >>>>>> testing against the S3 service.
> > >>>>>>
> > >>>>>>
> > >>>>>> On Tue, May 18, 2021 at 12:16 PM Wellington Chevreuil <
> > >>>>>> [email protected]> wrote:
> > >>>>>>
> > >>>>>>> Greetings everyone,
> > >>>>>>>
> > >>>>>>> HBASE-24749 has been proposed almost a year ago, introducing a
> > >> new
> > >>>>>>> StoreFile tracker as a way to allow for any hbase hfile
> > >>> modifications
> > >>>>> to
> > >>>>>> be
> > >>>>>>> safely completed without needing a file system rename. This seems
> > >>>>> pretty
> > >>>>>>> relevant for deployments over S3 file systems, where rename
> > >>>> operations
> > >>>>>> are
> > >>>>>>> not atomic and can have a performance degradation when multiple
> > >>>>> requests
> > >>>>>>> get concurrently submitted to the same bucket. We had done
> > >>>> superficial
> > >>>>>>> tests and ycsb runs, where individual renames of files larger
> > >> than
> > >>>> 5GB
> > >>>>>> can
> > >>>>>>> take a few hundreds of seconds to complete. We also observed
> > >>> impacts
> > >>>> in
> > >>>>>>> write loads throughput, the bottleneck potentially being the
> > >>> renames.
> > >>>>>>>
> > >>>>>>> With S3 being an important piece of my employer cloud solution,
> > >> we
> > >>>>> would
> > >>>>>>> like to help it move forward. We plan to contribute new patches
> > >> per
> > >>>> the
> > >>>>>>> original design/Jira, but we’d also be happy to review changes
> > >> from
> > >>>> the
> > >>>>>>> original authors, too. Please let us know if anyone has any
> > >>> concerns,
> > >>>>>>> otherwise we’ll start to self-assign issues on HBASE-24749
> > >>>>>>>
> > >>>>>>> Wellington
> > >>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> --
> > >>>>>> Best regards,
> > >>>>>> Andrew
> > >>>>>>
> > >>>>>> Words like orphans lost among the crosstalk, meaning torn from
> > >>> truth's
> > >>>>>> decrepit hands
> > >>>>>>   - A23, Crosstalk
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
>


-- 
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
   - A23, Crosstalk

Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

Reply via email to