> On May 21, 2021, at 6:07 PM, 张铎 <[email protected]> wrote: > > Since we just make use of the general FileSystem API to do listing, is it > possible to make use of ' bucket index listing'?
Yes, those words mean the same thing. > > Andrew Purtell <[email protected]> 于2021年5月22日周六 上午6:34写道: > >> >> >>> On May 20, 2021, at 4:00 AM, Wellington Chevreuil < >> [email protected]> wrote: >>> >>> >>>> >>>> >>>> IMO it should be a file per store. >>>> Per region is not suitable here as compaction is per store. >>>> Per file means we still need to list all the files. And usually, after >>>> compaction, we need to do an atomic operation to remove several old >> files >>>> and add a new file, or even several files for stripe compaction. It >> will be >>>> easy if we just write one file to commit these changes. >>>> >>> >>> Fine for me if it's simpler. Mentioned the per file approach because I >>> thought it could be easier/faster to do that, rather than having to >> update >>> the store file list on every flush. AFAIK, append is out of the table, so >>> updating this file would mean read it, write original content plus new >>> hfile to a temp file, delete original file, rename it). >>> >> >> That sounds right to be. >> >> A minor potential optimization is the filename could have a timestamp >> component, so a bucket index listing at that path would pick up a list >> including the latest, and the latest would be used as the manifest of valid >> store files. The cloud object store is expected to provide an atomic >> listing semantic where the file is written and closed and only then is it >> visible, and it is visible at once to everyone. (I think this is available >> on most.) Old manifest file versions could be lazily deleted. >> >> >>>> Em qui., 20 de mai. de 2021 às 02:57, 张铎(Duo Zhang) < >> [email protected]> >>>> escreveu: >>>> >>>> IIRC S3 is the only object storage which does not guarantee >>>> read-after-write consistency in the past... >>>> >>>> This is the quick result after googling >>>> >>>> AWS [1] >>>> >>>>> Amazon S3 delivers strong read-after-write consistency automatically >> for >>>>> all applications >>>> >>>> >>>> Azure[2] >>>> >>>>> Azure Storage was designed to embrace a strong consistency model that >>>>> guarantees that after the service performs an insert or update >> operation, >>>>> subsequent read operations return the latest update. >>>> >>>> >>>> Aliyun[3] >>>> >>>>> A feature requires that object operations in OSS be atomic, which >>>>> indicates that operations can only either succeed or fail without >>>>> intermediate states. To ensure that users can access only complete >> data, >>>>> OSS does not return corrupted or partial data. >>>>> >>>>> Object operations in OSS are highly consistent. For example, when a >> user >>>>> receives an upload (PUT) success response, the uploaded object can be >>>> read >>>>> immediately, and copies of the object are written to multiple devices >> for >>>>> redundancy. Therefore, the situations where data is not obtained when >> you >>>>> perform the read-after-write operation do not exist. The same is true >> for >>>>> delete operations. After you delete an object, the object and its >> copies >>>> no >>>>> longer exist. >>>>> >>>> >>>> GCP[4] >>>> >>>>> Cloud Storage provides strong global consistency for the following >>>>> operations, including both data and metadata: >>>>> >>>>> Read-after-write >>>>> Read-after-metadata-update >>>>> Read-after-delete >>>>> Bucket listing >>>>> Object listing >>>>> >>>> >>>> I think these vendors could cover most end users in the world? >>>> >>>> 1. https://aws.amazon.com/cn/s3/consistency/ >>>> 2. >>>> >>>> >> https://docs.microsoft.com/en-us/azure/storage/blobs/concurrency-manage?tabs=dotnet >>>> 3. https://www.alibabacloud.com/help/doc-detail/31827.htm >>>> 4. https://cloud.google.com/storage/docs/consistency >>>> >>>> Nick Dimiduk <[email protected]> 于2021年5月19日周三 下午11:40写道: >>>> >>>>> On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang) <[email protected]> >>>>> wrote: >>>>> >>>>>> What about just storing the hfile list in a file? Since now S3 has >>>> strong >>>>>> consistency, we could safely overwrite a file then I think? >>>>>> >>>>> >>>>> My concern is about portability. S3 isn't the only blob store in town, >>>> and >>>>> consistent read-what-you-wrote semantics are not a standard feature, as >>>> far >>>>> as I know. If we want something that can work on 3 or 5 major public >>>> cloud >>>>> blobstore products as well as a smattering of on-prem technologies, we >>>>> should be selective about what features we choose to rely on as >>>>> foundational to our implementation. >>>>> >>>>> Or we are explicitly saying this will only work on S3 and we'll only >>>>> support other services when they can achieve this level of >> compatibility. >>>>> >>>>> Either way, we should be clear and up-front about what semantics we >>>> demand. >>>>> Implementing some kind of a test harness that can check compatibility >>>> would >>>>> help here, a similar effort to that of defining standard behaviors of >>>> HDFS >>>>> implementations. >>>>> >>>>> I love this discussion :) >>>>> >>>>> And since the hfile list file will be very small, renaming will not be >> a >>>>>> big problem. >>>>>> >>>>> >>>>> Would this be a file per store? A file per region? Ah. Below you imply >>>> it's >>>>> per store. >>>>> >>>>> Wellington Chevreuil <[email protected]> 于2021年5月19日周三 >>>>>> 下午10:43写道: >>>>>> >>>>>>> Thank you, Andrew and Duo, >>>>>>> >>>>>>> Talking internally with Josh Elser, initial idea was to rebase the >>>>>> feature >>>>>>> branch with master (in order to catch with latest commits), then >>>> focus >>>>> on >>>>>>> work to have a minimal functioning hbase, in other words, together >>>> with >>>>>> the >>>>>>> already committed work from HBASE-25391, make sure flush, >>>> compactions, >>>>>>> splits and merges all can take advantage of the persistent store file >>>>>>> manager and complete with no need to rely on renames. These all map >>>> to >>>>>> the >>>>>>> substasks HBASE-25391, HBASE-25392 and HBASE-25393. Once we could >>>> test >>>>>> and >>>>>>> validate this works well for our goals, we can then focus on >>>> snapshots, >>>>>>> bulkloading and tooling. >>>>>>> >>>>>>> S3 now supports strong consistency, and I heard that they are also >>>>>>>> implementing atomic renaming currently, so maybe that's one of the >>>>>>> reasons >>>>>>>> why the development is silent now.. >>>>>>>> >>>>>>> Interesting, I had no idea this was being implemented. I know, >>>>> however, a >>>>>>> version of this feature is already available on latest EMR releases >>>> (at >>>>>>> least from 6.2.0), and AWS team has published their own blog post >>>> with >>>>>>> their results: >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >> https://aws.amazon.com/blogs/big-data/amazon-emr-6-2-0-adds-persistent-hfile-tracking-to-improve-performance-with-hbase-on-amazon-s3/ >>>>>>> >>>>>>> But I do not think store hfile list in meta is the only solution. It >>>>> will >>>>>>>> cause cyclic dependencies for hbase:meta, and then force us a have >>>> a >>>>>>>> fallback solution which makes the code a bit ugly. We should try to >>>>> see >>>>>>> if >>>>>>>> this could be done with only the FileSystem. >>>>>>>> >>>>>>> This is indeed a relevant concern. One idea I had mentioned in the >>>>>> original >>>>>>> design doc was to track committed/non-committed files through xattr >>>> (or >>>>>>> tags), which may have its own performance issues as explained by >>>>> Stephen >>>>>>> Wu, but is something that could be attempted. >>>>>>> >>>>>>> Em qua., 19 de mai. de 2021 às 04:56, 张铎(Duo Zhang) < >>>>>> [email protected] >>>>>>>> >>>>>>> escreveu: >>>>>>> >>>>>>>> S3 now supports strong consistency, and I heard that they are also >>>>>>>> implementing atomic renaming currently, so maybe that's one of the >>>>>>> reasons >>>>>>>> why the development is silent now... >>>>>>>> >>>>>>>> For me, I also think deploying hbase on cloud storage is the >>>> future, >>>>>> so I >>>>>>>> would also like to participate here. >>>>>>>> >>>>>>>> But I do not think store hfile list in meta is the only solution. >>>> It >>>>>> will >>>>>>>> cause cyclic dependencies for hbase:meta, and then force us a have >>>> a >>>>>>>> fallback solution which makes the code a bit ugly. We should try to >>>>> see >>>>>>> if >>>>>>>> this could be done with only the FileSystem. >>>>>>>> >>>>>>>> Thanks. >>>>>>>> >>>>>>>> Andrew Purtell <[email protected]> 于2021年5月19日周三 上午8:04写道: >>>>>>>> >>>>>>>>> Wellington (and et. al), >>>>>>>>> >>>>>>>>> S3 is also an important piece of our future production plans. >>>>>>>>> Unfortunately, we were unable to assist much with last year's >>>>> work, >>>>>> on >>>>>>>>> account of being sidetracked by more immediate concerns. >>>>> Fortunately, >>>>>>>> this >>>>>>>>> renewed interest is timely in that we have an HBase 2 project >>>>> where, >>>>>> if >>>>>>>>> this can land in a 2.5 or a 2.6, it could be an important cost to >>>>>> serve >>>>>>>>> optimization, and one we could and would make use of. Therefore I >>>>>> would >>>>>>>>> like to restate my employer's interest in this work too. It may >>>>> just >>>>>> be >>>>>>>>> Viraj and myself in the early days. >>>>>>>>> >>>>>>>>> I'm not sure how best to collaborate. We could review changes >>>> from >>>>>> the >>>>>>>>> original authors, new changes, and/or divide up the development >>>>>> tasks. >>>>>>> We >>>>>>>>> can certainly offer our time for testing, and can afford the >>>> costs >>>>> of >>>>>>>>> testing against the S3 service. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, May 18, 2021 at 12:16 PM Wellington Chevreuil < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Greetings everyone, >>>>>>>>>> >>>>>>>>>> HBASE-24749 has been proposed almost a year ago, introducing a >>>>> new >>>>>>>>>> StoreFile tracker as a way to allow for any hbase hfile >>>>>> modifications >>>>>>>> to >>>>>>>>> be >>>>>>>>>> safely completed without needing a file system rename. This >>>> seems >>>>>>>> pretty >>>>>>>>>> relevant for deployments over S3 file systems, where rename >>>>>>> operations >>>>>>>>> are >>>>>>>>>> not atomic and can have a performance degradation when multiple >>>>>>>> requests >>>>>>>>>> get concurrently submitted to the same bucket. We had done >>>>>>> superficial >>>>>>>>>> tests and ycsb runs, where individual renames of files larger >>>>> than >>>>>>> 5GB >>>>>>>>> can >>>>>>>>>> take a few hundreds of seconds to complete. We also observed >>>>>> impacts >>>>>>> in >>>>>>>>>> write loads throughput, the bottleneck potentially being the >>>>>> renames. >>>>>>>>>> >>>>>>>>>> With S3 being an important piece of my employer cloud solution, >>>>> we >>>>>>>> would >>>>>>>>>> like to help it move forward. We plan to contribute new patches >>>>> per >>>>>>> the >>>>>>>>>> original design/Jira, but we’d also be happy to review changes >>>>> from >>>>>>> the >>>>>>>>>> original authors, too. Please let us know if anyone has any >>>>>> concerns, >>>>>>>>>> otherwise we’ll start to self-assign issues on HBASE-24749 >>>>>>>>>> >>>>>>>>>> Wellington >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Best regards, >>>>>>>>> Andrew >>>>>>>>> >>>>>>>>> Words like orphans lost among the crosstalk, meaning torn from >>>>>> truth's >>>>>>>>> decrepit hands >>>>>>>>> - A23, Crosstalk >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>
