Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

Duo Zhang Fri, 21 May 2021 18:07:22 -0700

Since we just make use of the general FileSystem API to do listing, is it
possible to make use of ' bucket index listing'?


Andrew Purtell <[email protected]> 于2021年5月22日周六 上午6:34写道：

>
>
> > On May 20, 2021, at 4:00 AM, Wellington Chevreuil <
> [email protected]> wrote:
> >
> > 
> >>
> >>
> >> IMO it should be a file per store.
> >> Per region is not suitable here as compaction is per store.
> >> Per file means we still need to list all the files. And usually, after
> >> compaction, we need to do an atomic operation to remove several old
> files
> >> and add a new file, or even several files for stripe compaction. It
> will be
> >> easy if we just write one file to commit these changes.
> >>
> >
> > Fine for me if it's simpler. Mentioned the per file approach because I
> > thought it could be easier/faster to do that, rather than having to
> update
> > the store file list on every flush. AFAIK, append is out of the table, so
> > updating this file would mean read it, write original content plus new
> > hfile to a temp file, delete original file, rename it).
> >
>
> That sounds right to be.
>
> A minor potential optimization is the filename could have a timestamp
> component, so a bucket index listing at that path would pick up a list
> including the latest, and the latest would be used as the manifest of valid
> store files. The cloud object store is expected to provide an atomic
> listing semantic where the file is written and closed and only then is it
> visible, and it is visible at once to everyone. (I think this is available
> on most.) Old manifest file versions could be lazily deleted.
>
>
> >> Em qui., 20 de mai. de 2021 às 02:57, 张铎(Duo Zhang) <
> [email protected]>
> >> escreveu:
> >>
> >> IIRC S3 is the only object storage which does not guarantee
> >> read-after-write consistency in the past...
> >>
> >> This is the quick result after googling
> >>
> >> AWS [1]
> >>
> >>> Amazon S3 delivers strong read-after-write consistency automatically
> for
> >>> all applications
> >>
> >>
> >> Azure[2]
> >>
> >>> Azure Storage was designed to embrace a strong consistency model that
> >>> guarantees that after the service performs an insert or update
> operation,
> >>> subsequent read operations return the latest update.
> >>
> >>
> >> Aliyun[3]
> >>
> >>> A feature requires that object operations in OSS be atomic, which
> >>> indicates that operations can only either succeed or fail without
> >>> intermediate states. To ensure that users can access only complete
> data,
> >>> OSS does not return corrupted or partial data.
> >>>
> >>> Object operations in OSS are highly consistent. For example, when a
> user
> >>> receives an upload (PUT) success response, the uploaded object can be
> >> read
> >>> immediately, and copies of the object are written to multiple devices
> for
> >>> redundancy. Therefore, the situations where data is not obtained when
> you
> >>> perform the read-after-write operation do not exist. The same is true
> for
> >>> delete operations. After you delete an object, the object and its
> copies
> >> no
> >>> longer exist.
> >>>
> >>
> >> GCP[4]
> >>
> >>> Cloud Storage provides strong global consistency for the following
> >>> operations, including both data and metadata:
> >>>
> >>> Read-after-write
> >>> Read-after-metadata-update
> >>> Read-after-delete
> >>> Bucket listing
> >>> Object listing
> >>>
> >>
> >> I think these vendors could cover most end users in the world?
> >>
> >> 1. https://aws.amazon.com/cn/s3/consistency/
> >> 2.
> >>
> >>
> https://docs.microsoft.com/en-us/azure/storage/blobs/concurrency-manage?tabs=dotnet
> >> 3. https://www.alibabacloud.com/help/doc-detail/31827.htm
> >> 4. https://cloud.google.com/storage/docs/consistency
> >>
> >> Nick Dimiduk <[email protected]> 于2021年5月19日周三 下午11:40写道：
> >>
> >>> On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang) <[email protected]>
> >>> wrote:
> >>>
> >>>> What about just storing the hfile list in a file? Since now S3 has
> >> strong
> >>>> consistency, we could safely overwrite a file then I think?
> >>>>
> >>>
> >>> My concern is about portability. S3 isn't the only blob store in town,
> >> and
> >>> consistent read-what-you-wrote semantics are not a standard feature, as
> >> far
> >>> as I know. If we want something that can work on 3 or 5 major public
> >> cloud
> >>> blobstore products as well as a smattering of on-prem technologies, we
> >>> should be selective about what features we choose to rely on as
> >>> foundational to our implementation.
> >>>
> >>> Or we are explicitly saying this will only work on S3 and we'll only
> >>> support other services when they can achieve this level of
> compatibility.
> >>>
> >>> Either way, we should be clear and up-front about what semantics we
> >> demand.
> >>> Implementing some kind of a test harness that can check compatibility
> >> would
> >>> help here, a similar effort to that of defining standard behaviors of
> >> HDFS
> >>> implementations.
> >>>
> >>> I love this discussion :)
> >>>
> >>> And since the hfile list file will be very small, renaming will not be
> a
> >>>> big problem.
> >>>>
> >>>
> >>> Would this be a file per store? A file per region? Ah. Below you imply
> >> it's
> >>> per store.
> >>>
> >>> Wellington Chevreuil <[email protected]> 于2021年5月19日周三
> >>>> 下午10:43写道：
> >>>>
> >>>>> Thank you, Andrew and Duo,
> >>>>>
> >>>>> Talking internally with Josh Elser, initial idea was to rebase the
> >>>> feature
> >>>>> branch with master (in order to catch with latest commits), then
> >> focus
> >>> on
> >>>>> work to have a minimal functioning hbase, in other words, together
> >> with
> >>>> the
> >>>>> already committed work from HBASE-25391, make sure flush,
> >> compactions,
> >>>>> splits and merges all can take advantage of the persistent store file
> >>>>> manager and complete with no need to rely on renames. These all map
> >> to
> >>>> the
> >>>>> substasks HBASE-25391, HBASE-25392 and HBASE-25393. Once we could
> >> test
> >>>> and
> >>>>> validate this works well for our goals, we can then focus on
> >> snapshots,
> >>>>> bulkloading and tooling.
> >>>>>
> >>>>> S3 now supports strong consistency, and I heard that they are also
> >>>>>> implementing atomic renaming currently, so maybe that's one of the
> >>>>> reasons
> >>>>>> why the development is silent now..
> >>>>>>
> >>>>> Interesting, I had no idea this was being implemented. I know,
> >>> however, a
> >>>>> version of this feature is already available on latest EMR releases
> >> (at
> >>>>> least from 6.2.0), and AWS team has published their own blog post
> >> with
> >>>>> their results:
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://aws.amazon.com/blogs/big-data/amazon-emr-6-2-0-adds-persistent-hfile-tracking-to-improve-performance-with-hbase-on-amazon-s3/
> >>>>>
> >>>>> But I do not think store hfile list in meta is the only solution. It
> >>> will
> >>>>>> cause cyclic dependencies for hbase:meta, and then force us a have
> >> a
> >>>>>> fallback solution which makes the code a bit ugly. We should try to
> >>> see
> >>>>> if
> >>>>>> this could be done with only the FileSystem.
> >>>>>>
> >>>>> This is indeed a relevant concern. One idea I had mentioned in the
> >>>> original
> >>>>> design doc was to track committed/non-committed files through xattr
> >> (or
> >>>>> tags), which may have its own performance issues as explained by
> >>> Stephen
> >>>>> Wu, but is something that could be attempted.
> >>>>>
> >>>>> Em qua., 19 de mai. de 2021 às 04:56, 张铎(Duo Zhang) <
> >>>> [email protected]
> >>>>>>
> >>>>> escreveu:
> >>>>>
> >>>>>> S3 now supports strong consistency, and I heard that they are also
> >>>>>> implementing atomic renaming currently, so maybe that's one of the
> >>>>> reasons
> >>>>>> why the development is silent now...
> >>>>>>
> >>>>>> For me, I also think deploying hbase on cloud storage is the
> >> future,
> >>>> so I
> >>>>>> would also like to participate here.
> >>>>>>
> >>>>>> But I do not think store hfile list in meta is the only solution.
> >> It
> >>>> will
> >>>>>> cause cyclic dependencies for hbase:meta, and then force us a have
> >> a
> >>>>>> fallback solution which makes the code a bit ugly. We should try to
> >>> see
> >>>>> if
> >>>>>> this could be done with only the FileSystem.
> >>>>>>
> >>>>>> Thanks.
> >>>>>>
> >>>>>> Andrew Purtell <[email protected]> 于2021年5月19日周三 上午8:04写道：
> >>>>>>
> >>>>>>> Wellington (and et. al),
> >>>>>>>
> >>>>>>> S3 is also an important piece of our future production plans.
> >>>>>>> Unfortunately,  we were unable to assist much with last year's
> >>> work,
> >>>> on
> >>>>>>> account of being sidetracked by more immediate concerns.
> >>> Fortunately,
> >>>>>> this
> >>>>>>> renewed interest is timely in that we have an HBase 2 project
> >>> where,
> >>>> if
> >>>>>>> this can land in a 2.5 or a 2.6, it could be an important cost to
> >>>> serve
> >>>>>>> optimization, and one we could and would make use of. Therefore I
> >>>> would
> >>>>>>> like to restate my employer's interest in this work too. It may
> >>> just
> >>>> be
> >>>>>>> Viraj and myself in the early days.
> >>>>>>>
> >>>>>>> I'm not sure how best to collaborate. We could review changes
> >> from
> >>>> the
> >>>>>>> original authors, new changes, and/or divide up the development
> >>>> tasks.
> >>>>> We
> >>>>>>> can certainly offer our time for testing, and can afford the
> >> costs
> >>> of
> >>>>>>> testing against the S3 service.
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, May 18, 2021 at 12:16 PM Wellington Chevreuil <
> >>>>>>> [email protected]> wrote:
> >>>>>>>
> >>>>>>>> Greetings everyone,
> >>>>>>>>
> >>>>>>>> HBASE-24749 has been proposed almost a year ago, introducing a
> >>> new
> >>>>>>>> StoreFile tracker as a way to allow for any hbase hfile
> >>>> modifications
> >>>>>> to
> >>>>>>> be
> >>>>>>>> safely completed without needing a file system rename. This
> >> seems
> >>>>>> pretty
> >>>>>>>> relevant for deployments over S3 file systems, where rename
> >>>>> operations
> >>>>>>> are
> >>>>>>>> not atomic and can have a performance degradation when multiple
> >>>>>> requests
> >>>>>>>> get concurrently submitted to the same bucket. We had done
> >>>>> superficial
> >>>>>>>> tests and ycsb runs, where individual renames of files larger
> >>> than
> >>>>> 5GB
> >>>>>>> can
> >>>>>>>> take a few hundreds of seconds to complete. We also observed
> >>>> impacts
> >>>>> in
> >>>>>>>> write loads throughput, the bottleneck potentially being the
> >>>> renames.
> >>>>>>>>
> >>>>>>>> With S3 being an important piece of my employer cloud solution,
> >>> we
> >>>>>> would
> >>>>>>>> like to help it move forward. We plan to contribute new patches
> >>> per
> >>>>> the
> >>>>>>>> original design/Jira, but we’d also be happy to review changes
> >>> from
> >>>>> the
> >>>>>>>> original authors, too. Please let us know if anyone has any
> >>>> concerns,
> >>>>>>>> otherwise we’ll start to self-assign issues on HBASE-24749
> >>>>>>>>
> >>>>>>>> Wellington
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Best regards,
> >>>>>>> Andrew
> >>>>>>>
> >>>>>>> Words like orphans lost among the crosstalk, meaning torn from
> >>>> truth's
> >>>>>>> decrepit hands
> >>>>>>>   - A23, Crosstalk
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>

Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

Reply via email to