Just go ahead Josh, I haven't started to write the design doc yet. Thank you for your help!
Josh Elser <[email protected]> 于2021年5月25日周二 上午1:45写道: > Without completely opening Pandora's box, I will say we definitely have > multiple ways we can solve the metadata management for tracking (e.g. in > meta, in some other system table, in some other system, in a per-store > file). Each of them have pro's and con's, and each of them has "favor" > as to what pain we've most recently felt as a project. > > I don't want to defer having the discussion on what the "correct" one > should be, but I do want to point out that it's only half of the problem > of storefile tracking. > > My hope is that we can make this tracking system be pluggable, such that > we can prototype a solution that works "good enough" for now and enables > the rest of the development work to keep moving forward. > > I'm happy to see so many other folks also interested in the design of > how we store this. > > Could I suggest we move this discussion around the metadata storage into > its own thread? If Duo doesn't already have a design doc started, I can > also try to put one together this week. > > Does that work for you all? > > On 5/22/21 11:02 AM, 张铎(Duo Zhang) wrote: > > I could put up a simple design doc for this. > > > > But there is still a problem, about how to do rolling upgrading. > > > > After we changed the behavior, the region server will write partial store > > files directly into the data directory. For new region servers, this is > not > > a problem, as we will read the hfilelist file to find out the valid store > > files. > > But when rolling upgrading, we can not upgrade all the regionservers at > > once, for old regionservers, they will initialize a store by listing the > > store files, so if a new regionserver crashes when compacting and its > > regions are assigned to old regionservers, the old regionservers will be > in > > trouble... > > > > Stack <[email protected]> 于2021年5月22日周六 下午12:14写道: > > > >> HBASE-24749 design and implementation had acknowledged compromises on > >> review: e.g. adding a new 'system table' to hold store files. I'd > suggest > >> the design and implementation need a revisit before we go forward; for > >> instance, factoring for systems other than s3 as suggested above (I like > >> the Duo list). > >> > >> S > >> > >> On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang) <[email protected]> > >> wrote: > >> > >>> What about just storing the hfile list in a file? Since now S3 has > strong > >>> consistency, we could safely overwrite a file then I think? > >>> > >>> And since the hfile list file will be very small, renaming will not be > a > >>> big problem. > >>> > >>> We could write the hfile list to a file called 'hfile.list.tmp', and > then > >>> rename it to 'hfile.list'. > >>> > >>> This is safe for HDFS, and for S3, since it is not atomic, maybe we > could > >>> face that, the 'hfile.list' file is not there, but there is a > >>> 'hfile.list.tmp'. > >>> > >>> So when opening a HStore, we first check if 'hfile.list' is there, if > >> not, > >>> try 'hfile.list.tmp', rename it and load it. For safety, we could write > >> an > >>> initial hfile list file with no hfiles. So if we can not load either > >>> 'hfile.list' or 'hfile.list.tmp', then we know something is wrong so > >> users > >>> should try to fix it with HBCK. > >>> And in HBCK, we will do a listing and generate the 'hfile.list' file. > >>> > >>> WDYT? > >>> > >>> Thanks. > >>> > >>> Wellington Chevreuil <[email protected]> 于2021年5月19日周三 > >>> 下午10:43写道: > >>> > >>>> Thank you, Andrew and Duo, > >>>> > >>>> Talking internally with Josh Elser, initial idea was to rebase the > >>> feature > >>>> branch with master (in order to catch with latest commits), then focus > >> on > >>>> work to have a minimal functioning hbase, in other words, together > with > >>> the > >>>> already committed work from HBASE-25391, make sure flush, compactions, > >>>> splits and merges all can take advantage of the persistent store file > >>>> manager and complete with no need to rely on renames. These all map to > >>> the > >>>> substasks HBASE-25391, HBASE-25392 and HBASE-25393. Once we could test > >>> and > >>>> validate this works well for our goals, we can then focus on > snapshots, > >>>> bulkloading and tooling. > >>>> > >>>> S3 now supports strong consistency, and I heard that they are also > >>>>> implementing atomic renaming currently, so maybe that's one of the > >>>> reasons > >>>>> why the development is silent now.. > >>>>> > >>>> Interesting, I had no idea this was being implemented. I know, > >> however, a > >>>> version of this feature is already available on latest EMR releases > (at > >>>> least from 6.2.0), and AWS team has published their own blog post with > >>>> their results: > >>>> > >>>> > >>> > >> > https://aws.amazon.com/blogs/big-data/amazon-emr-6-2-0-adds-persistent-hfile-tracking-to-improve-performance-with-hbase-on-amazon-s3/ > >>>> > >>>> But I do not think store hfile list in meta is the only solution. It > >> will > >>>>> cause cyclic dependencies for hbase:meta, and then force us a have a > >>>>> fallback solution which makes the code a bit ugly. We should try to > >> see > >>>> if > >>>>> this could be done with only the FileSystem. > >>>>> > >>>> This is indeed a relevant concern. One idea I had mentioned in the > >>> original > >>>> design doc was to track committed/non-committed files through xattr > (or > >>>> tags), which may have its own performance issues as explained by > >> Stephen > >>>> Wu, but is something that could be attempted. > >>>> > >>>> Em qua., 19 de mai. de 2021 às 04:56, 张铎(Duo Zhang) < > >>> [email protected] > >>>>> > >>>> escreveu: > >>>> > >>>>> S3 now supports strong consistency, and I heard that they are also > >>>>> implementing atomic renaming currently, so maybe that's one of the > >>>> reasons > >>>>> why the development is silent now... > >>>>> > >>>>> For me, I also think deploying hbase on cloud storage is the future, > >>> so I > >>>>> would also like to participate here. > >>>>> > >>>>> But I do not think store hfile list in meta is the only solution. It > >>> will > >>>>> cause cyclic dependencies for hbase:meta, and then force us a have a > >>>>> fallback solution which makes the code a bit ugly. We should try to > >> see > >>>> if > >>>>> this could be done with only the FileSystem. > >>>>> > >>>>> Thanks. > >>>>> > >>>>> Andrew Purtell <[email protected]> 于2021年5月19日周三 上午8:04写道: > >>>>> > >>>>>> Wellington (and et. al), > >>>>>> > >>>>>> S3 is also an important piece of our future production plans. > >>>>>> Unfortunately, we were unable to assist much with last year's > >> work, > >>> on > >>>>>> account of being sidetracked by more immediate concerns. > >> Fortunately, > >>>>> this > >>>>>> renewed interest is timely in that we have an HBase 2 project > >> where, > >>> if > >>>>>> this can land in a 2.5 or a 2.6, it could be an important cost to > >>> serve > >>>>>> optimization, and one we could and would make use of. Therefore I > >>> would > >>>>>> like to restate my employer's interest in this work too. It may > >> just > >>> be > >>>>>> Viraj and myself in the early days. > >>>>>> > >>>>>> I'm not sure how best to collaborate. We could review changes from > >>> the > >>>>>> original authors, new changes, and/or divide up the development > >>> tasks. > >>>> We > >>>>>> can certainly offer our time for testing, and can afford the costs > >> of > >>>>>> testing against the S3 service. > >>>>>> > >>>>>> > >>>>>> On Tue, May 18, 2021 at 12:16 PM Wellington Chevreuil < > >>>>>> [email protected]> wrote: > >>>>>> > >>>>>>> Greetings everyone, > >>>>>>> > >>>>>>> HBASE-24749 has been proposed almost a year ago, introducing a > >> new > >>>>>>> StoreFile tracker as a way to allow for any hbase hfile > >>> modifications > >>>>> to > >>>>>> be > >>>>>>> safely completed without needing a file system rename. This seems > >>>>> pretty > >>>>>>> relevant for deployments over S3 file systems, where rename > >>>> operations > >>>>>> are > >>>>>>> not atomic and can have a performance degradation when multiple > >>>>> requests > >>>>>>> get concurrently submitted to the same bucket. We had done > >>>> superficial > >>>>>>> tests and ycsb runs, where individual renames of files larger > >> than > >>>> 5GB > >>>>>> can > >>>>>>> take a few hundreds of seconds to complete. We also observed > >>> impacts > >>>> in > >>>>>>> write loads throughput, the bottleneck potentially being the > >>> renames. > >>>>>>> > >>>>>>> With S3 being an important piece of my employer cloud solution, > >> we > >>>>> would > >>>>>>> like to help it move forward. We plan to contribute new patches > >> per > >>>> the > >>>>>>> original design/Jira, but we’d also be happy to review changes > >> from > >>>> the > >>>>>>> original authors, too. Please let us know if anyone has any > >>> concerns, > >>>>>>> otherwise we’ll start to self-assign issues on HBASE-24749 > >>>>>>> > >>>>>>> Wellington > >>>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> Best regards, > >>>>>> Andrew > >>>>>> > >>>>>> Words like orphans lost among the crosstalk, meaning torn from > >>> truth's > >>>>>> decrepit hands > >>>>>> - A23, Crosstalk > >>>>>> > >>>>> > >>>> > >>> > >> > > >
