Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

Andrew Purtell Fri, 21 May 2021 15:34:25 -0700

> On May 20, 2021, at 4:00 AM, Wellington Chevreuil 
> <[email protected]> wrote:
> 
> 
>> 
>> 
>> IMO it should be a file per store.
>> Per region is not suitable here as compaction is per store.
>> Per file means we still need to list all the files. And usually, after
>> compaction, we need to do an atomic operation to remove several old files
>> and add a new file, or even several files for stripe compaction. It will be
>> easy if we just write one file to commit these changes.
>> 
> 
> Fine for me if it's simpler. Mentioned the per file approach because I
> thought it could be easier/faster to do that, rather than having to update
> the store file list on every flush. AFAIK, append is out of the table, so
> updating this file would mean read it, write original content plus new
> hfile to a temp file, delete original file, rename it).
> 

That sounds right to be. 

A minor potential optimization is the filename could have a timestamp 
component, so a bucket index listing at that path would pick up a list 
including the latest, and the latest would be used as the manifest of valid 
store files. The cloud object store is expected to provide an atomic listing 
semantic where the file is written and closed and only then is it visible, and 
it is visible at once to everyone. (I think this is available on most.) Old 
manifest file versions could be lazily deleted. 


>> Em qui., 20 de mai. de 2021 às 02:57, 张铎(Duo Zhang) <[email protected]>
>> escreveu:
>> 
>> IIRC S3 is the only object storage which does not guarantee
>> read-after-write consistency in the past...
>> 
>> This is the quick result after googling
>> 
>> AWS [1]
>> 
>>> Amazon S3 delivers strong read-after-write consistency automatically for
>>> all applications
>> 
>> 
>> Azure[2]
>> 
>>> Azure Storage was designed to embrace a strong consistency model that
>>> guarantees that after the service performs an insert or update operation,
>>> subsequent read operations return the latest update.
>> 
>> 
>> Aliyun[3]
>> 
>>> A feature requires that object operations in OSS be atomic, which
>>> indicates that operations can only either succeed or fail without
>>> intermediate states. To ensure that users can access only complete data,
>>> OSS does not return corrupted or partial data.
>>> 
>>> Object operations in OSS are highly consistent. For example, when a user
>>> receives an upload (PUT) success response, the uploaded object can be
>> read
>>> immediately, and copies of the object are written to multiple devices for
>>> redundancy. Therefore, the situations where data is not obtained when you
>>> perform the read-after-write operation do not exist. The same is true for
>>> delete operations. After you delete an object, the object and its copies
>> no
>>> longer exist.
>>> 
>> 
>> GCP[4]
>> 
>>> Cloud Storage provides strong global consistency for the following
>>> operations, including both data and metadata:
>>> 
>>> Read-after-write
>>> Read-after-metadata-update
>>> Read-after-delete
>>> Bucket listing
>>> Object listing
>>> 
>> 
>> I think these vendors could cover most end users in the world?
>> 
>> 1. https://aws.amazon.com/cn/s3/consistency/
>> 2.
>> 
>> https://docs.microsoft.com/en-us/azure/storage/blobs/concurrency-manage?tabs=dotnet
>> 3. https://www.alibabacloud.com/help/doc-detail/31827.htm
>> 4. https://cloud.google.com/storage/docs/consistency
>> 
>> Nick Dimiduk <[email protected]> 于2021年5月19日周三 下午11:40写道：
>> 
>>> On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang) <[email protected]>
>>> wrote:
>>> 
>>>> What about just storing the hfile list in a file? Since now S3 has
>> strong
>>>> consistency, we could safely overwrite a file then I think?
>>>> 
>>> 
>>> My concern is about portability. S3 isn't the only blob store in town,
>> and
>>> consistent read-what-you-wrote semantics are not a standard feature, as
>> far
>>> as I know. If we want something that can work on 3 or 5 major public
>> cloud
>>> blobstore products as well as a smattering of on-prem technologies, we
>>> should be selective about what features we choose to rely on as
>>> foundational to our implementation.
>>> 
>>> Or we are explicitly saying this will only work on S3 and we'll only
>>> support other services when they can achieve this level of compatibility.
>>> 
>>> Either way, we should be clear and up-front about what semantics we
>> demand.
>>> Implementing some kind of a test harness that can check compatibility
>> would
>>> help here, a similar effort to that of defining standard behaviors of
>> HDFS
>>> implementations.
>>> 
>>> I love this discussion :)
>>> 
>>> And since the hfile list file will be very small, renaming will not be a
>>>> big problem.
>>>> 
>>> 
>>> Would this be a file per store? A file per region? Ah. Below you imply
>> it's
>>> per store.
>>> 
>>> Wellington Chevreuil <[email protected]> 于2021年5月19日周三
>>>> 下午10:43写道：
>>>> 
>>>>> Thank you, Andrew and Duo,
>>>>> 
>>>>> Talking internally with Josh Elser, initial idea was to rebase the
>>>> feature
>>>>> branch with master (in order to catch with latest commits), then
>> focus
>>> on
>>>>> work to have a minimal functioning hbase, in other words, together
>> with
>>>> the
>>>>> already committed work from HBASE-25391, make sure flush,
>> compactions,
>>>>> splits and merges all can take advantage of the persistent store file
>>>>> manager and complete with no need to rely on renames. These all map
>> to
>>>> the
>>>>> substasks HBASE-25391, HBASE-25392 and HBASE-25393. Once we could
>> test
>>>> and
>>>>> validate this works well for our goals, we can then focus on
>> snapshots,
>>>>> bulkloading and tooling.
>>>>> 
>>>>> S3 now supports strong consistency, and I heard that they are also
>>>>>> implementing atomic renaming currently, so maybe that's one of the
>>>>> reasons
>>>>>> why the development is silent now..
>>>>>> 
>>>>> Interesting, I had no idea this was being implemented. I know,
>>> however, a
>>>>> version of this feature is already available on latest EMR releases
>> (at
>>>>> least from 6.2.0), and AWS team has published their own blog post
>> with
>>>>> their results:
>>>>> 
>>>>> 
>>>> 
>>> 
>> https://aws.amazon.com/blogs/big-data/amazon-emr-6-2-0-adds-persistent-hfile-tracking-to-improve-performance-with-hbase-on-amazon-s3/
>>>>> 
>>>>> But I do not think store hfile list in meta is the only solution. It
>>> will
>>>>>> cause cyclic dependencies for hbase:meta, and then force us a have
>> a
>>>>>> fallback solution which makes the code a bit ugly. We should try to
>>> see
>>>>> if
>>>>>> this could be done with only the FileSystem.
>>>>>> 
>>>>> This is indeed a relevant concern. One idea I had mentioned in the
>>>> original
>>>>> design doc was to track committed/non-committed files through xattr
>> (or
>>>>> tags), which may have its own performance issues as explained by
>>> Stephen
>>>>> Wu, but is something that could be attempted.
>>>>> 
>>>>> Em qua., 19 de mai. de 2021 às 04:56, 张铎(Duo Zhang) <
>>>> [email protected]
>>>>>> 
>>>>> escreveu:
>>>>> 
>>>>>> S3 now supports strong consistency, and I heard that they are also
>>>>>> implementing atomic renaming currently, so maybe that's one of the
>>>>> reasons
>>>>>> why the development is silent now...
>>>>>> 
>>>>>> For me, I also think deploying hbase on cloud storage is the
>> future,
>>>> so I
>>>>>> would also like to participate here.
>>>>>> 
>>>>>> But I do not think store hfile list in meta is the only solution.
>> It
>>>> will
>>>>>> cause cyclic dependencies for hbase:meta, and then force us a have
>> a
>>>>>> fallback solution which makes the code a bit ugly. We should try to
>>> see
>>>>> if
>>>>>> this could be done with only the FileSystem.
>>>>>> 
>>>>>> Thanks.
>>>>>> 
>>>>>> Andrew Purtell <[email protected]> 于2021年5月19日周三 上午8:04写道：
>>>>>> 
>>>>>>> Wellington (and et. al),
>>>>>>> 
>>>>>>> S3 is also an important piece of our future production plans.
>>>>>>> Unfortunately,  we were unable to assist much with last year's
>>> work,
>>>> on
>>>>>>> account of being sidetracked by more immediate concerns.
>>> Fortunately,
>>>>>> this
>>>>>>> renewed interest is timely in that we have an HBase 2 project
>>> where,
>>>> if
>>>>>>> this can land in a 2.5 or a 2.6, it could be an important cost to
>>>> serve
>>>>>>> optimization, and one we could and would make use of. Therefore I
>>>> would
>>>>>>> like to restate my employer's interest in this work too. It may
>>> just
>>>> be
>>>>>>> Viraj and myself in the early days.
>>>>>>> 
>>>>>>> I'm not sure how best to collaborate. We could review changes
>> from
>>>> the
>>>>>>> original authors, new changes, and/or divide up the development
>>>> tasks.
>>>>> We
>>>>>>> can certainly offer our time for testing, and can afford the
>> costs
>>> of
>>>>>>> testing against the S3 service.
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, May 18, 2021 at 12:16 PM Wellington Chevreuil <
>>>>>>> [email protected]> wrote:
>>>>>>> 
>>>>>>>> Greetings everyone,
>>>>>>>> 
>>>>>>>> HBASE-24749 has been proposed almost a year ago, introducing a
>>> new
>>>>>>>> StoreFile tracker as a way to allow for any hbase hfile
>>>> modifications
>>>>>> to
>>>>>>> be
>>>>>>>> safely completed without needing a file system rename. This
>> seems
>>>>>> pretty
>>>>>>>> relevant for deployments over S3 file systems, where rename
>>>>> operations
>>>>>>> are
>>>>>>>> not atomic and can have a performance degradation when multiple
>>>>>> requests
>>>>>>>> get concurrently submitted to the same bucket. We had done
>>>>> superficial
>>>>>>>> tests and ycsb runs, where individual renames of files larger
>>> than
>>>>> 5GB
>>>>>>> can
>>>>>>>> take a few hundreds of seconds to complete. We also observed
>>>> impacts
>>>>> in
>>>>>>>> write loads throughput, the bottleneck potentially being the
>>>> renames.
>>>>>>>> 
>>>>>>>> With S3 being an important piece of my employer cloud solution,
>>> we
>>>>>> would
>>>>>>>> like to help it move forward. We plan to contribute new patches
>>> per
>>>>> the
>>>>>>>> original design/Jira, but we’d also be happy to review changes
>>> from
>>>>> the
>>>>>>>> original authors, too. Please let us know if anyone has any
>>>> concerns,
>>>>>>>> otherwise we’ll start to self-assign issues on HBASE-24749
>>>>>>>> 
>>>>>>>> Wellington
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Best regards,
>>>>>>> Andrew
>>>>>>> 
>>>>>>> Words like orphans lost among the crosstalk, meaning torn from
>>>> truth's
>>>>>>> decrepit hands
>>>>>>>   - A23, Crosstalk
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>
Re: [DISCUSS] Implement and release HBASE-24749 (an hfile tracker that allows for avoiding renames)

Reply via email to