Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

Andrew Purtell Fri, 21 May 2021 18:35:42 -0700


> On May 21, 2021, at 6:07 PM, 张铎 <[email protected]> wrote:
> 
> Since we just make use of the general FileSystem API to do listing, is it
> possible to make use of ' bucket index listing'?


Yes, those words mean the same thing. 

> 
> Andrew Purtell <[email protected]> 于2021年5月22日周六 上午6:34写道：
> 
>> 
>> 
>>> On May 20, 2021, at 4:00 AM, Wellington Chevreuil <
>> [email protected]> wrote:
>>> 
>>> 
>>>> 
>>>> 
>>>> IMO it should be a file per store.
>>>> Per region is not suitable here as compaction is per store.
>>>> Per file means we still need to list all the files. And usually, after
>>>> compaction, we need to do an atomic operation to remove several old
>> files
>>>> and add a new file, or even several files for stripe compaction. It
>> will be
>>>> easy if we just write one file to commit these changes.
>>>> 
>>> 
>>> Fine for me if it's simpler. Mentioned the per file approach because I
>>> thought it could be easier/faster to do that, rather than having to
>> update
>>> the store file list on every flush. AFAIK, append is out of the table, so
>>> updating this file would mean read it, write original content plus new
>>> hfile to a temp file, delete original file, rename it).
>>> 
>> 
>> That sounds right to be.
>> 
>> A minor potential optimization is the filename could have a timestamp
>> component, so a bucket index listing at that path would pick up a list
>> including the latest, and the latest would be used as the manifest of valid
>> store files. The cloud object store is expected to provide an atomic
>> listing semantic where the file is written and closed and only then is it
>> visible, and it is visible at once to everyone. (I think this is available
>> on most.) Old manifest file versions could be lazily deleted.
>> 
>> 
>>>> Em qui., 20 de mai. de 2021 às 02:57, 张铎(Duo Zhang) <
>> [email protected]>
>>>> escreveu:
>>>> 
>>>> IIRC S3 is the only object storage which does not guarantee
>>>> read-after-write consistency in the past...
>>>> 
>>>> This is the quick result after googling
>>>> 
>>>> AWS [1]
>>>> 
>>>>> Amazon S3 delivers strong read-after-write consistency automatically
>> for
>>>>> all applications
>>>> 
>>>> 
>>>> Azure[2]
>>>> 
>>>>> Azure Storage was designed to embrace a strong consistency model that
>>>>> guarantees that after the service performs an insert or update
>> operation,
>>>>> subsequent read operations return the latest update.
>>>> 
>>>> 
>>>> Aliyun[3]
>>>> 
>>>>> A feature requires that object operations in OSS be atomic, which
>>>>> indicates that operations can only either succeed or fail without
>>>>> intermediate states. To ensure that users can access only complete
>> data,
>>>>> OSS does not return corrupted or partial data.
>>>>> 
>>>>> Object operations in OSS are highly consistent. For example, when a
>> user
>>>>> receives an upload (PUT) success response, the uploaded object can be
>>>> read
>>>>> immediately, and copies of the object are written to multiple devices
>> for
>>>>> redundancy. Therefore, the situations where data is not obtained when
>> you
>>>>> perform the read-after-write operation do not exist. The same is true
>> for
>>>>> delete operations. After you delete an object, the object and its
>> copies
>>>> no
>>>>> longer exist.
>>>>> 
>>>> 
>>>> GCP[4]
>>>> 
>>>>> Cloud Storage provides strong global consistency for the following
>>>>> operations, including both data and metadata:
>>>>> 
>>>>> Read-after-write
>>>>> Read-after-metadata-update
>>>>> Read-after-delete
>>>>> Bucket listing
>>>>> Object listing
>>>>> 
>>>> 
>>>> I think these vendors could cover most end users in the world?
>>>> 
>>>> 1. https://aws.amazon.com/cn/s3/consistency/
>>>> 2.
>>>> 
>>>> 
>> https://docs.microsoft.com/en-us/azure/storage/blobs/concurrency-manage?tabs=dotnet
>>>> 3. https://www.alibabacloud.com/help/doc-detail/31827.htm
>>>> 4. https://cloud.google.com/storage/docs/consistency
>>>> 
>>>> Nick Dimiduk <[email protected]> 于2021年5月19日周三 下午11:40写道：
>>>> 
>>>>> On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang) <[email protected]>
>>>>> wrote:
>>>>> 
>>>>>> What about just storing the hfile list in a file? Since now S3 has
>>>> strong
>>>>>> consistency, we could safely overwrite a file then I think?
>>>>>> 
>>>>> 
>>>>> My concern is about portability. S3 isn't the only blob store in town,
>>>> and
>>>>> consistent read-what-you-wrote semantics are not a standard feature, as
>>>> far
>>>>> as I know. If we want something that can work on 3 or 5 major public
>>>> cloud
>>>>> blobstore products as well as a smattering of on-prem technologies, we
>>>>> should be selective about what features we choose to rely on as
>>>>> foundational to our implementation.
>>>>> 
>>>>> Or we are explicitly saying this will only work on S3 and we'll only
>>>>> support other services when they can achieve this level of
>> compatibility.
>>>>> 
>>>>> Either way, we should be clear and up-front about what semantics we
>>>> demand.
>>>>> Implementing some kind of a test harness that can check compatibility
>>>> would
>>>>> help here, a similar effort to that of defining standard behaviors of
>>>> HDFS
>>>>> implementations.
>>>>> 
>>>>> I love this discussion :)
>>>>> 
>>>>> And since the hfile list file will be very small, renaming will not be
>> a
>>>>>> big problem.
>>>>>> 
>>>>> 
>>>>> Would this be a file per store? A file per region? Ah. Below you imply
>>>> it's
>>>>> per store.
>>>>> 
>>>>> Wellington Chevreuil <[email protected]> 于2021年5月19日周三
>>>>>> 下午10:43写道：
>>>>>> 
>>>>>>> Thank you, Andrew and Duo,
>>>>>>> 
>>>>>>> Talking internally with Josh Elser, initial idea was to rebase the
>>>>>> feature
>>>>>>> branch with master (in order to catch with latest commits), then
>>>> focus
>>>>> on
>>>>>>> work to have a minimal functioning hbase, in other words, together
>>>> with
>>>>>> the
>>>>>>> already committed work from HBASE-25391, make sure flush,
>>>> compactions,
>>>>>>> splits and merges all can take advantage of the persistent store file
>>>>>>> manager and complete with no need to rely on renames. These all map
>>>> to
>>>>>> the
>>>>>>> substasks HBASE-25391, HBASE-25392 and HBASE-25393. Once we could
>>>> test
>>>>>> and
>>>>>>> validate this works well for our goals, we can then focus on
>>>> snapshots,
>>>>>>> bulkloading and tooling.
>>>>>>> 
>>>>>>> S3 now supports strong consistency, and I heard that they are also
>>>>>>>> implementing atomic renaming currently, so maybe that's one of the
>>>>>>> reasons
>>>>>>>> why the development is silent now..
>>>>>>>> 
>>>>>>> Interesting, I had no idea this was being implemented. I know,
>>>>> however, a
>>>>>>> version of this feature is already available on latest EMR releases
>>>> (at
>>>>>>> least from 6.2.0), and AWS team has published their own blog post
>>>> with
>>>>>>> their results:
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> https://aws.amazon.com/blogs/big-data/amazon-emr-6-2-0-adds-persistent-hfile-tracking-to-improve-performance-with-hbase-on-amazon-s3/
>>>>>>> 
>>>>>>> But I do not think store hfile list in meta is the only solution. It
>>>>> will
>>>>>>>> cause cyclic dependencies for hbase:meta, and then force us a have
>>>> a
>>>>>>>> fallback solution which makes the code a bit ugly. We should try to
>>>>> see
>>>>>>> if
>>>>>>>> this could be done with only the FileSystem.
>>>>>>>> 
>>>>>>> This is indeed a relevant concern. One idea I had mentioned in the
>>>>>> original
>>>>>>> design doc was to track committed/non-committed files through xattr
>>>> (or
>>>>>>> tags), which may have its own performance issues as explained by
>>>>> Stephen
>>>>>>> Wu, but is something that could be attempted.
>>>>>>> 
>>>>>>> Em qua., 19 de mai. de 2021 às 04:56, 张铎(Duo Zhang) <
>>>>>> [email protected]
>>>>>>>> 
>>>>>>> escreveu:
>>>>>>> 
>>>>>>>> S3 now supports strong consistency, and I heard that they are also
>>>>>>>> implementing atomic renaming currently, so maybe that's one of the
>>>>>>> reasons
>>>>>>>> why the development is silent now...
>>>>>>>> 
>>>>>>>> For me, I also think deploying hbase on cloud storage is the
>>>> future,
>>>>>> so I
>>>>>>>> would also like to participate here.
>>>>>>>> 
>>>>>>>> But I do not think store hfile list in meta is the only solution.
>>>> It
>>>>>> will
>>>>>>>> cause cyclic dependencies for hbase:meta, and then force us a have
>>>> a
>>>>>>>> fallback solution which makes the code a bit ugly. We should try to
>>>>> see
>>>>>>> if
>>>>>>>> this could be done with only the FileSystem.
>>>>>>>> 
>>>>>>>> Thanks.
>>>>>>>> 
>>>>>>>> Andrew Purtell <[email protected]> 于2021年5月19日周三 上午8:04写道：
>>>>>>>> 
>>>>>>>>> Wellington (and et. al),
>>>>>>>>> 
>>>>>>>>> S3 is also an important piece of our future production plans.
>>>>>>>>> Unfortunately,  we were unable to assist much with last year's
>>>>> work,
>>>>>> on
>>>>>>>>> account of being sidetracked by more immediate concerns.
>>>>> Fortunately,
>>>>>>>> this
>>>>>>>>> renewed interest is timely in that we have an HBase 2 project
>>>>> where,
>>>>>> if
>>>>>>>>> this can land in a 2.5 or a 2.6, it could be an important cost to
>>>>>> serve
>>>>>>>>> optimization, and one we could and would make use of. Therefore I
>>>>>> would
>>>>>>>>> like to restate my employer's interest in this work too. It may
>>>>> just
>>>>>> be
>>>>>>>>> Viraj and myself in the early days.
>>>>>>>>> 
>>>>>>>>> I'm not sure how best to collaborate. We could review changes
>>>> from
>>>>>> the
>>>>>>>>> original authors, new changes, and/or divide up the development
>>>>>> tasks.
>>>>>>> We
>>>>>>>>> can certainly offer our time for testing, and can afford the
>>>> costs
>>>>> of
>>>>>>>>> testing against the S3 service.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Tue, May 18, 2021 at 12:16 PM Wellington Chevreuil <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> 
>>>>>>>>>> Greetings everyone,
>>>>>>>>>> 
>>>>>>>>>> HBASE-24749 has been proposed almost a year ago, introducing a
>>>>> new
>>>>>>>>>> StoreFile tracker as a way to allow for any hbase hfile
>>>>>> modifications
>>>>>>>> to
>>>>>>>>> be
>>>>>>>>>> safely completed without needing a file system rename. This
>>>> seems
>>>>>>>> pretty
>>>>>>>>>> relevant for deployments over S3 file systems, where rename
>>>>>>> operations
>>>>>>>>> are
>>>>>>>>>> not atomic and can have a performance degradation when multiple
>>>>>>>> requests
>>>>>>>>>> get concurrently submitted to the same bucket. We had done
>>>>>>> superficial
>>>>>>>>>> tests and ycsb runs, where individual renames of files larger
>>>>> than
>>>>>>> 5GB
>>>>>>>>> can
>>>>>>>>>> take a few hundreds of seconds to complete. We also observed
>>>>>> impacts
>>>>>>> in
>>>>>>>>>> write loads throughput, the bottleneck potentially being the
>>>>>> renames.
>>>>>>>>>> 
>>>>>>>>>> With S3 being an important piece of my employer cloud solution,
>>>>> we
>>>>>>>> would
>>>>>>>>>> like to help it move forward. We plan to contribute new patches
>>>>> per
>>>>>>> the
>>>>>>>>>> original design/Jira, but we’d also be happy to review changes
>>>>> from
>>>>>>> the
>>>>>>>>>> original authors, too. Please let us know if anyone has any
>>>>>> concerns,
>>>>>>>>>> otherwise we’ll start to self-assign issues on HBASE-24749
>>>>>>>>>> 
>>>>>>>>>> Wellington
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Best regards,
>>>>>>>>> Andrew
>>>>>>>>> 
>>>>>>>>> Words like orphans lost among the crosstalk, meaning torn from
>>>>>> truth's
>>>>>>>>> decrepit hands
>>>>>>>>>  - A23, Crosstalk
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>

Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

Reply via email to