Since we just make use of the general FileSystem API to do listing, is it possible to make use of ' bucket index listing'?
Andrew Purtell <[email protected]> 于2021年5月22日周六 上午6:34写道: > > > > On May 20, 2021, at 4:00 AM, Wellington Chevreuil < > [email protected]> wrote: > > > > > >> > >> > >> IMO it should be a file per store. > >> Per region is not suitable here as compaction is per store. > >> Per file means we still need to list all the files. And usually, after > >> compaction, we need to do an atomic operation to remove several old > files > >> and add a new file, or even several files for stripe compaction. It > will be > >> easy if we just write one file to commit these changes. > >> > > > > Fine for me if it's simpler. Mentioned the per file approach because I > > thought it could be easier/faster to do that, rather than having to > update > > the store file list on every flush. AFAIK, append is out of the table, so > > updating this file would mean read it, write original content plus new > > hfile to a temp file, delete original file, rename it). > > > > That sounds right to be. > > A minor potential optimization is the filename could have a timestamp > component, so a bucket index listing at that path would pick up a list > including the latest, and the latest would be used as the manifest of valid > store files. The cloud object store is expected to provide an atomic > listing semantic where the file is written and closed and only then is it > visible, and it is visible at once to everyone. (I think this is available > on most.) Old manifest file versions could be lazily deleted. > > > >> Em qui., 20 de mai. de 2021 às 02:57, 张铎(Duo Zhang) < > [email protected]> > >> escreveu: > >> > >> IIRC S3 is the only object storage which does not guarantee > >> read-after-write consistency in the past... > >> > >> This is the quick result after googling > >> > >> AWS [1] > >> > >>> Amazon S3 delivers strong read-after-write consistency automatically > for > >>> all applications > >> > >> > >> Azure[2] > >> > >>> Azure Storage was designed to embrace a strong consistency model that > >>> guarantees that after the service performs an insert or update > operation, > >>> subsequent read operations return the latest update. > >> > >> > >> Aliyun[3] > >> > >>> A feature requires that object operations in OSS be atomic, which > >>> indicates that operations can only either succeed or fail without > >>> intermediate states. To ensure that users can access only complete > data, > >>> OSS does not return corrupted or partial data. > >>> > >>> Object operations in OSS are highly consistent. For example, when a > user > >>> receives an upload (PUT) success response, the uploaded object can be > >> read > >>> immediately, and copies of the object are written to multiple devices > for > >>> redundancy. Therefore, the situations where data is not obtained when > you > >>> perform the read-after-write operation do not exist. The same is true > for > >>> delete operations. After you delete an object, the object and its > copies > >> no > >>> longer exist. > >>> > >> > >> GCP[4] > >> > >>> Cloud Storage provides strong global consistency for the following > >>> operations, including both data and metadata: > >>> > >>> Read-after-write > >>> Read-after-metadata-update > >>> Read-after-delete > >>> Bucket listing > >>> Object listing > >>> > >> > >> I think these vendors could cover most end users in the world? > >> > >> 1. https://aws.amazon.com/cn/s3/consistency/ > >> 2. > >> > >> > https://docs.microsoft.com/en-us/azure/storage/blobs/concurrency-manage?tabs=dotnet > >> 3. https://www.alibabacloud.com/help/doc-detail/31827.htm > >> 4. https://cloud.google.com/storage/docs/consistency > >> > >> Nick Dimiduk <[email protected]> 于2021年5月19日周三 下午11:40写道: > >> > >>> On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang) <[email protected]> > >>> wrote: > >>> > >>>> What about just storing the hfile list in a file? Since now S3 has > >> strong > >>>> consistency, we could safely overwrite a file then I think? > >>>> > >>> > >>> My concern is about portability. S3 isn't the only blob store in town, > >> and > >>> consistent read-what-you-wrote semantics are not a standard feature, as > >> far > >>> as I know. If we want something that can work on 3 or 5 major public > >> cloud > >>> blobstore products as well as a smattering of on-prem technologies, we > >>> should be selective about what features we choose to rely on as > >>> foundational to our implementation. > >>> > >>> Or we are explicitly saying this will only work on S3 and we'll only > >>> support other services when they can achieve this level of > compatibility. > >>> > >>> Either way, we should be clear and up-front about what semantics we > >> demand. > >>> Implementing some kind of a test harness that can check compatibility > >> would > >>> help here, a similar effort to that of defining standard behaviors of > >> HDFS > >>> implementations. > >>> > >>> I love this discussion :) > >>> > >>> And since the hfile list file will be very small, renaming will not be > a > >>>> big problem. > >>>> > >>> > >>> Would this be a file per store? A file per region? Ah. Below you imply > >> it's > >>> per store. > >>> > >>> Wellington Chevreuil <[email protected]> 于2021年5月19日周三 > >>>> 下午10:43写道: > >>>> > >>>>> Thank you, Andrew and Duo, > >>>>> > >>>>> Talking internally with Josh Elser, initial idea was to rebase the > >>>> feature > >>>>> branch with master (in order to catch with latest commits), then > >> focus > >>> on > >>>>> work to have a minimal functioning hbase, in other words, together > >> with > >>>> the > >>>>> already committed work from HBASE-25391, make sure flush, > >> compactions, > >>>>> splits and merges all can take advantage of the persistent store file > >>>>> manager and complete with no need to rely on renames. These all map > >> to > >>>> the > >>>>> substasks HBASE-25391, HBASE-25392 and HBASE-25393. Once we could > >> test > >>>> and > >>>>> validate this works well for our goals, we can then focus on > >> snapshots, > >>>>> bulkloading and tooling. > >>>>> > >>>>> S3 now supports strong consistency, and I heard that they are also > >>>>>> implementing atomic renaming currently, so maybe that's one of the > >>>>> reasons > >>>>>> why the development is silent now.. > >>>>>> > >>>>> Interesting, I had no idea this was being implemented. I know, > >>> however, a > >>>>> version of this feature is already available on latest EMR releases > >> (at > >>>>> least from 6.2.0), and AWS team has published their own blog post > >> with > >>>>> their results: > >>>>> > >>>>> > >>>> > >>> > >> > https://aws.amazon.com/blogs/big-data/amazon-emr-6-2-0-adds-persistent-hfile-tracking-to-improve-performance-with-hbase-on-amazon-s3/ > >>>>> > >>>>> But I do not think store hfile list in meta is the only solution. It > >>> will > >>>>>> cause cyclic dependencies for hbase:meta, and then force us a have > >> a > >>>>>> fallback solution which makes the code a bit ugly. We should try to > >>> see > >>>>> if > >>>>>> this could be done with only the FileSystem. > >>>>>> > >>>>> This is indeed a relevant concern. One idea I had mentioned in the > >>>> original > >>>>> design doc was to track committed/non-committed files through xattr > >> (or > >>>>> tags), which may have its own performance issues as explained by > >>> Stephen > >>>>> Wu, but is something that could be attempted. > >>>>> > >>>>> Em qua., 19 de mai. de 2021 às 04:56, 张铎(Duo Zhang) < > >>>> [email protected] > >>>>>> > >>>>> escreveu: > >>>>> > >>>>>> S3 now supports strong consistency, and I heard that they are also > >>>>>> implementing atomic renaming currently, so maybe that's one of the > >>>>> reasons > >>>>>> why the development is silent now... > >>>>>> > >>>>>> For me, I also think deploying hbase on cloud storage is the > >> future, > >>>> so I > >>>>>> would also like to participate here. > >>>>>> > >>>>>> But I do not think store hfile list in meta is the only solution. > >> It > >>>> will > >>>>>> cause cyclic dependencies for hbase:meta, and then force us a have > >> a > >>>>>> fallback solution which makes the code a bit ugly. We should try to > >>> see > >>>>> if > >>>>>> this could be done with only the FileSystem. > >>>>>> > >>>>>> Thanks. > >>>>>> > >>>>>> Andrew Purtell <[email protected]> 于2021年5月19日周三 上午8:04写道: > >>>>>> > >>>>>>> Wellington (and et. al), > >>>>>>> > >>>>>>> S3 is also an important piece of our future production plans. > >>>>>>> Unfortunately, we were unable to assist much with last year's > >>> work, > >>>> on > >>>>>>> account of being sidetracked by more immediate concerns. > >>> Fortunately, > >>>>>> this > >>>>>>> renewed interest is timely in that we have an HBase 2 project > >>> where, > >>>> if > >>>>>>> this can land in a 2.5 or a 2.6, it could be an important cost to > >>>> serve > >>>>>>> optimization, and one we could and would make use of. Therefore I > >>>> would > >>>>>>> like to restate my employer's interest in this work too. It may > >>> just > >>>> be > >>>>>>> Viraj and myself in the early days. > >>>>>>> > >>>>>>> I'm not sure how best to collaborate. We could review changes > >> from > >>>> the > >>>>>>> original authors, new changes, and/or divide up the development > >>>> tasks. > >>>>> We > >>>>>>> can certainly offer our time for testing, and can afford the > >> costs > >>> of > >>>>>>> testing against the S3 service. > >>>>>>> > >>>>>>> > >>>>>>> On Tue, May 18, 2021 at 12:16 PM Wellington Chevreuil < > >>>>>>> [email protected]> wrote: > >>>>>>> > >>>>>>>> Greetings everyone, > >>>>>>>> > >>>>>>>> HBASE-24749 has been proposed almost a year ago, introducing a > >>> new > >>>>>>>> StoreFile tracker as a way to allow for any hbase hfile > >>>> modifications > >>>>>> to > >>>>>>> be > >>>>>>>> safely completed without needing a file system rename. This > >> seems > >>>>>> pretty > >>>>>>>> relevant for deployments over S3 file systems, where rename > >>>>> operations > >>>>>>> are > >>>>>>>> not atomic and can have a performance degradation when multiple > >>>>>> requests > >>>>>>>> get concurrently submitted to the same bucket. We had done > >>>>> superficial > >>>>>>>> tests and ycsb runs, where individual renames of files larger > >>> than > >>>>> 5GB > >>>>>>> can > >>>>>>>> take a few hundreds of seconds to complete. We also observed > >>>> impacts > >>>>> in > >>>>>>>> write loads throughput, the bottleneck potentially being the > >>>> renames. > >>>>>>>> > >>>>>>>> With S3 being an important piece of my employer cloud solution, > >>> we > >>>>>> would > >>>>>>>> like to help it move forward. We plan to contribute new patches > >>> per > >>>>> the > >>>>>>>> original design/Jira, but we’d also be happy to review changes > >>> from > >>>>> the > >>>>>>>> original authors, too. Please let us know if anyone has any > >>>> concerns, > >>>>>>>> otherwise we’ll start to self-assign issues on HBASE-24749 > >>>>>>>> > >>>>>>>> Wellington > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Best regards, > >>>>>>> Andrew > >>>>>>> > >>>>>>> Words like orphans lost among the crosstalk, meaning torn from > >>>> truth's > >>>>>>> decrepit hands > >>>>>>> - A23, Crosstalk > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> >
