Oh, sorry. Missed that. I think the key point here is we should not have partial storefiles in the data directory if we want to downgrade. This is possible by setting the flag to false first to prevent new partial storefiles, and then use a HBCK command to remove all the partial storefiles? And in general, I think we should have a way to clean the broken storefiles automatically when the new layout is in use, so we could also rely on it to remove the broken storefiles maybe.
Thanks. Andrew Purtell <[email protected]> 于2021年5月25日周二 上午12:24写道: > > And for downgrading, usually we do not support downgrading from a major > version upgrading, so it is not a big problem. > > You missed an earlier comment from me. > > Our team requires this to be released in a branch-2 version or we can't use > it. Therefore I am not in favor of any solution that requires a major > version increment. > > > On Sun, May 23, 2021 at 5:43 AM 张铎(Duo Zhang) <[email protected]> > wrote: > > > I do not think it should be a table level config. It should be a cluster > > level config. We only have one FileSystem so it is useless to let > different > > tables have different ways to store hfile list. > > > > But I think the general approach is fine. We could introduce a config for > > whether to enable 'write to data directory directly' mode. When rolling > > upgrading, the flag should be false, you can change it to true after the > > whole cluster has been upgraded. > > > > And for downgrading, usually we do not support downgrading from a major > > version upgrading, so it is not a big problem. > > > > Thanks. > > > > Andrew Purtell <[email protected]> 于2021年5月23日周日 上午12:53写道: > > > > > Put a check in the code whether hfilelist mode or original store layout > > is > > > in use and handles both cases. Then, to upgrade: > > > > > > 1. First, perform a rolling upgrade to $NEW_VERSION . > > > > > > 2. Once upgraded to $NEW_VERSION execute an alter table command that > > > enables hfilelist mode. This will cause all regions to close and reopen > > in > > > the new mode. > > > > > > Because the rolling upgrade to $NEW_VERSION is completed first a mix of > > > old and new layouts is fine, for the brief period of time when store > > > layouts are upgrading in response to the alter command, because this > > > version can handle both. > > > > > > Downgrade to an older version is not possible after the alter table > > > command, so this must be clearly documented, but of course would not > be a > > > surprise to anyone, because the alter command is for switching to the > new > > > store layout. > > > > > > > > > > On May 22, 2021, at 8:03 AM, 张铎 <[email protected]> wrote: > > > > > > > > I could put up a simple design doc for this. > > > > > > > > But there is still a problem, about how to do rolling upgrading. > > > > > > > > After we changed the behavior, the region server will write partial > > store > > > > files directly into the data directory. For new region servers, this > is > > > not > > > > a problem, as we will read the hfilelist file to find out the valid > > store > > > > files. > > > > But when rolling upgrading, we can not upgrade all the regionservers > at > > > > once, for old regionservers, they will initialize a store by listing > > the > > > > store files, so if a new regionserver crashes when compacting and its > > > > regions are assigned to old regionservers, the old regionservers will > > be > > > in > > > > trouble... > > > > > > > > Stack <[email protected]> 于2021年5月22日周六 下午12:14写道: > > > > > > > >> HBASE-24749 design and implementation had acknowledged compromises > on > > > >> review: e.g. adding a new 'system table' to hold store files. I'd > > > suggest > > > >> the design and implementation need a revisit before we go forward; > for > > > >> instance, factoring for systems other than s3 as suggested above (I > > like > > > >> the Duo list). > > > >> > > > >> S > > > >> > > > >>> On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang) < > [email protected] > > > > > > >>> wrote: > > > >>> > > > >>> What about just storing the hfile list in a file? Since now S3 has > > > strong > > > >>> consistency, we could safely overwrite a file then I think? > > > >>> > > > >>> And since the hfile list file will be very small, renaming will not > > be > > > a > > > >>> big problem. > > > >>> > > > >>> We could write the hfile list to a file called 'hfile.list.tmp', > and > > > then > > > >>> rename it to 'hfile.list'. > > > >>> > > > >>> This is safe for HDFS, and for S3, since it is not atomic, maybe we > > > could > > > >>> face that, the 'hfile.list' file is not there, but there is a > > > >>> 'hfile.list.tmp'. > > > >>> > > > >>> So when opening a HStore, we first check if 'hfile.list' is there, > if > > > >> not, > > > >>> try 'hfile.list.tmp', rename it and load it. For safety, we could > > write > > > >> an > > > >>> initial hfile list file with no hfiles. So if we can not load > either > > > >>> 'hfile.list' or 'hfile.list.tmp', then we know something is wrong > so > > > >> users > > > >>> should try to fix it with HBCK. > > > >>> And in HBCK, we will do a listing and generate the 'hfile.list' > file. > > > >>> > > > >>> WDYT? > > > >>> > > > >>> Thanks. > > > >>> > > > >>> Wellington Chevreuil <[email protected]> > 于2021年5月19日周三 > > > >>> 下午10:43写道: > > > >>> > > > >>>> Thank you, Andrew and Duo, > > > >>>> > > > >>>> Talking internally with Josh Elser, initial idea was to rebase the > > > >>> feature > > > >>>> branch with master (in order to catch with latest commits), then > > focus > > > >> on > > > >>>> work to have a minimal functioning hbase, in other words, together > > > with > > > >>> the > > > >>>> already committed work from HBASE-25391, make sure flush, > > compactions, > > > >>>> splits and merges all can take advantage of the persistent store > > file > > > >>>> manager and complete with no need to rely on renames. These all > map > > to > > > >>> the > > > >>>> substasks HBASE-25391, HBASE-25392 and HBASE-25393. Once we could > > test > > > >>> and > > > >>>> validate this works well for our goals, we can then focus on > > > snapshots, > > > >>>> bulkloading and tooling. > > > >>>> > > > >>>> S3 now supports strong consistency, and I heard that they are also > > > >>>>> implementing atomic renaming currently, so maybe that's one of > the > > > >>>> reasons > > > >>>>> why the development is silent now.. > > > >>>>> > > > >>>> Interesting, I had no idea this was being implemented. I know, > > > >> however, a > > > >>>> version of this feature is already available on latest EMR > releases > > > (at > > > >>>> least from 6.2.0), and AWS team has published their own blog post > > with > > > >>>> their results: > > > >>>> > > > >>>> > > > >>> > > > >> > > > > > > https://aws.amazon.com/blogs/big-data/amazon-emr-6-2-0-adds-persistent-hfile-tracking-to-improve-performance-with-hbase-on-amazon-s3/ > > > >>>> > > > >>>> But I do not think store hfile list in meta is the only solution. > It > > > >> will > > > >>>>> cause cyclic dependencies for hbase:meta, and then force us a > have > > a > > > >>>>> fallback solution which makes the code a bit ugly. We should try > to > > > >> see > > > >>>> if > > > >>>>> this could be done with only the FileSystem. > > > >>>>> > > > >>>> This is indeed a relevant concern. One idea I had mentioned in the > > > >>> original > > > >>>> design doc was to track committed/non-committed files through > xattr > > > (or > > > >>>> tags), which may have its own performance issues as explained by > > > >> Stephen > > > >>>> Wu, but is something that could be attempted. > > > >>>> > > > >>>> Em qua., 19 de mai. de 2021 às 04:56, 张铎(Duo Zhang) < > > > >>> [email protected] > > > >>>>> > > > >>>> escreveu: > > > >>>> > > > >>>>> S3 now supports strong consistency, and I heard that they are > also > > > >>>>> implementing atomic renaming currently, so maybe that's one of > the > > > >>>> reasons > > > >>>>> why the development is silent now... > > > >>>>> > > > >>>>> For me, I also think deploying hbase on cloud storage is the > > future, > > > >>> so I > > > >>>>> would also like to participate here. > > > >>>>> > > > >>>>> But I do not think store hfile list in meta is the only solution. > > It > > > >>> will > > > >>>>> cause cyclic dependencies for hbase:meta, and then force us a > have > > a > > > >>>>> fallback solution which makes the code a bit ugly. We should try > to > > > >> see > > > >>>> if > > > >>>>> this could be done with only the FileSystem. > > > >>>>> > > > >>>>> Thanks. > > > >>>>> > > > >>>>> Andrew Purtell <[email protected]> 于2021年5月19日周三 上午8:04写道: > > > >>>>> > > > >>>>>> Wellington (and et. al), > > > >>>>>> > > > >>>>>> S3 is also an important piece of our future production plans. > > > >>>>>> Unfortunately, we were unable to assist much with last year's > > > >> work, > > > >>> on > > > >>>>>> account of being sidetracked by more immediate concerns. > > > >> Fortunately, > > > >>>>> this > > > >>>>>> renewed interest is timely in that we have an HBase 2 project > > > >> where, > > > >>> if > > > >>>>>> this can land in a 2.5 or a 2.6, it could be an important cost > to > > > >>> serve > > > >>>>>> optimization, and one we could and would make use of. Therefore > I > > > >>> would > > > >>>>>> like to restate my employer's interest in this work too. It may > > > >> just > > > >>> be > > > >>>>>> Viraj and myself in the early days. > > > >>>>>> > > > >>>>>> I'm not sure how best to collaborate. We could review changes > from > > > >>> the > > > >>>>>> original authors, new changes, and/or divide up the development > > > >>> tasks. > > > >>>> We > > > >>>>>> can certainly offer our time for testing, and can afford the > costs > > > >> of > > > >>>>>> testing against the S3 service. > > > >>>>>> > > > >>>>>> > > > >>>>>> On Tue, May 18, 2021 at 12:16 PM Wellington Chevreuil < > > > >>>>>> [email protected]> wrote: > > > >>>>>> > > > >>>>>>> Greetings everyone, > > > >>>>>>> > > > >>>>>>> HBASE-24749 has been proposed almost a year ago, introducing a > > > >> new > > > >>>>>>> StoreFile tracker as a way to allow for any hbase hfile > > > >>> modifications > > > >>>>> to > > > >>>>>> be > > > >>>>>>> safely completed without needing a file system rename. This > seems > > > >>>>> pretty > > > >>>>>>> relevant for deployments over S3 file systems, where rename > > > >>>> operations > > > >>>>>> are > > > >>>>>>> not atomic and can have a performance degradation when multiple > > > >>>>> requests > > > >>>>>>> get concurrently submitted to the same bucket. We had done > > > >>>> superficial > > > >>>>>>> tests and ycsb runs, where individual renames of files larger > > > >> than > > > >>>> 5GB > > > >>>>>> can > > > >>>>>>> take a few hundreds of seconds to complete. We also observed > > > >>> impacts > > > >>>> in > > > >>>>>>> write loads throughput, the bottleneck potentially being the > > > >>> renames. > > > >>>>>>> > > > >>>>>>> With S3 being an important piece of my employer cloud solution, > > > >> we > > > >>>>> would > > > >>>>>>> like to help it move forward. We plan to contribute new patches > > > >> per > > > >>>> the > > > >>>>>>> original design/Jira, but we’d also be happy to review changes > > > >> from > > > >>>> the > > > >>>>>>> original authors, too. Please let us know if anyone has any > > > >>> concerns, > > > >>>>>>> otherwise we’ll start to self-assign issues on HBASE-24749 > > > >>>>>>> > > > >>>>>>> Wellington > > > >>>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> -- > > > >>>>>> Best regards, > > > >>>>>> Andrew > > > >>>>>> > > > >>>>>> Words like orphans lost among the crosstalk, meaning torn from > > > >>> truth's > > > >>>>>> decrepit hands > > > >>>>>> - A23, Crosstalk > > > >>>>>> > > > >>>>> > > > >>>> > > > >>> > > > >> > > > > > > > > -- > Best regards, > Andrew > > Words like orphans lost among the crosstalk, meaning torn from truth's > decrepit hands > - A23, Crosstalk >
