> And for downgrading, usually we do not support downgrading from a major version upgrading, so it is not a big problem.
You missed an earlier comment from me. Our team requires this to be released in a branch-2 version or we can't use it. Therefore I am not in favor of any solution that requires a major version increment. On Sun, May 23, 2021 at 5:43 AM 张铎(Duo Zhang) <[email protected]> wrote: > I do not think it should be a table level config. It should be a cluster > level config. We only have one FileSystem so it is useless to let different > tables have different ways to store hfile list. > > But I think the general approach is fine. We could introduce a config for > whether to enable 'write to data directory directly' mode. When rolling > upgrading, the flag should be false, you can change it to true after the > whole cluster has been upgraded. > > And for downgrading, usually we do not support downgrading from a major > version upgrading, so it is not a big problem. > > Thanks. > > Andrew Purtell <[email protected]> 于2021年5月23日周日 上午12:53写道: > > > Put a check in the code whether hfilelist mode or original store layout > is > > in use and handles both cases. Then, to upgrade: > > > > 1. First, perform a rolling upgrade to $NEW_VERSION . > > > > 2. Once upgraded to $NEW_VERSION execute an alter table command that > > enables hfilelist mode. This will cause all regions to close and reopen > in > > the new mode. > > > > Because the rolling upgrade to $NEW_VERSION is completed first a mix of > > old and new layouts is fine, for the brief period of time when store > > layouts are upgrading in response to the alter command, because this > > version can handle both. > > > > Downgrade to an older version is not possible after the alter table > > command, so this must be clearly documented, but of course would not be a > > surprise to anyone, because the alter command is for switching to the new > > store layout. > > > > > > > On May 22, 2021, at 8:03 AM, 张铎 <[email protected]> wrote: > > > > > > I could put up a simple design doc for this. > > > > > > But there is still a problem, about how to do rolling upgrading. > > > > > > After we changed the behavior, the region server will write partial > store > > > files directly into the data directory. For new region servers, this is > > not > > > a problem, as we will read the hfilelist file to find out the valid > store > > > files. > > > But when rolling upgrading, we can not upgrade all the regionservers at > > > once, for old regionservers, they will initialize a store by listing > the > > > store files, so if a new regionserver crashes when compacting and its > > > regions are assigned to old regionservers, the old regionservers will > be > > in > > > trouble... > > > > > > Stack <[email protected]> 于2021年5月22日周六 下午12:14写道: > > > > > >> HBASE-24749 design and implementation had acknowledged compromises on > > >> review: e.g. adding a new 'system table' to hold store files. I'd > > suggest > > >> the design and implementation need a revisit before we go forward; for > > >> instance, factoring for systems other than s3 as suggested above (I > like > > >> the Duo list). > > >> > > >> S > > >> > > >>> On Wed, May 19, 2021 at 8:19 AM 张铎(Duo Zhang) <[email protected] > > > > >>> wrote: > > >>> > > >>> What about just storing the hfile list in a file? Since now S3 has > > strong > > >>> consistency, we could safely overwrite a file then I think? > > >>> > > >>> And since the hfile list file will be very small, renaming will not > be > > a > > >>> big problem. > > >>> > > >>> We could write the hfile list to a file called 'hfile.list.tmp', and > > then > > >>> rename it to 'hfile.list'. > > >>> > > >>> This is safe for HDFS, and for S3, since it is not atomic, maybe we > > could > > >>> face that, the 'hfile.list' file is not there, but there is a > > >>> 'hfile.list.tmp'. > > >>> > > >>> So when opening a HStore, we first check if 'hfile.list' is there, if > > >> not, > > >>> try 'hfile.list.tmp', rename it and load it. For safety, we could > write > > >> an > > >>> initial hfile list file with no hfiles. So if we can not load either > > >>> 'hfile.list' or 'hfile.list.tmp', then we know something is wrong so > > >> users > > >>> should try to fix it with HBCK. > > >>> And in HBCK, we will do a listing and generate the 'hfile.list' file. > > >>> > > >>> WDYT? > > >>> > > >>> Thanks. > > >>> > > >>> Wellington Chevreuil <[email protected]> 于2021年5月19日周三 > > >>> 下午10:43写道: > > >>> > > >>>> Thank you, Andrew and Duo, > > >>>> > > >>>> Talking internally with Josh Elser, initial idea was to rebase the > > >>> feature > > >>>> branch with master (in order to catch with latest commits), then > focus > > >> on > > >>>> work to have a minimal functioning hbase, in other words, together > > with > > >>> the > > >>>> already committed work from HBASE-25391, make sure flush, > compactions, > > >>>> splits and merges all can take advantage of the persistent store > file > > >>>> manager and complete with no need to rely on renames. These all map > to > > >>> the > > >>>> substasks HBASE-25391, HBASE-25392 and HBASE-25393. Once we could > test > > >>> and > > >>>> validate this works well for our goals, we can then focus on > > snapshots, > > >>>> bulkloading and tooling. > > >>>> > > >>>> S3 now supports strong consistency, and I heard that they are also > > >>>>> implementing atomic renaming currently, so maybe that's one of the > > >>>> reasons > > >>>>> why the development is silent now.. > > >>>>> > > >>>> Interesting, I had no idea this was being implemented. I know, > > >> however, a > > >>>> version of this feature is already available on latest EMR releases > > (at > > >>>> least from 6.2.0), and AWS team has published their own blog post > with > > >>>> their results: > > >>>> > > >>>> > > >>> > > >> > > > https://aws.amazon.com/blogs/big-data/amazon-emr-6-2-0-adds-persistent-hfile-tracking-to-improve-performance-with-hbase-on-amazon-s3/ > > >>>> > > >>>> But I do not think store hfile list in meta is the only solution. It > > >> will > > >>>>> cause cyclic dependencies for hbase:meta, and then force us a have > a > > >>>>> fallback solution which makes the code a bit ugly. We should try to > > >> see > > >>>> if > > >>>>> this could be done with only the FileSystem. > > >>>>> > > >>>> This is indeed a relevant concern. One idea I had mentioned in the > > >>> original > > >>>> design doc was to track committed/non-committed files through xattr > > (or > > >>>> tags), which may have its own performance issues as explained by > > >> Stephen > > >>>> Wu, but is something that could be attempted. > > >>>> > > >>>> Em qua., 19 de mai. de 2021 às 04:56, 张铎(Duo Zhang) < > > >>> [email protected] > > >>>>> > > >>>> escreveu: > > >>>> > > >>>>> S3 now supports strong consistency, and I heard that they are also > > >>>>> implementing atomic renaming currently, so maybe that's one of the > > >>>> reasons > > >>>>> why the development is silent now... > > >>>>> > > >>>>> For me, I also think deploying hbase on cloud storage is the > future, > > >>> so I > > >>>>> would also like to participate here. > > >>>>> > > >>>>> But I do not think store hfile list in meta is the only solution. > It > > >>> will > > >>>>> cause cyclic dependencies for hbase:meta, and then force us a have > a > > >>>>> fallback solution which makes the code a bit ugly. We should try to > > >> see > > >>>> if > > >>>>> this could be done with only the FileSystem. > > >>>>> > > >>>>> Thanks. > > >>>>> > > >>>>> Andrew Purtell <[email protected]> 于2021年5月19日周三 上午8:04写道: > > >>>>> > > >>>>>> Wellington (and et. al), > > >>>>>> > > >>>>>> S3 is also an important piece of our future production plans. > > >>>>>> Unfortunately, we were unable to assist much with last year's > > >> work, > > >>> on > > >>>>>> account of being sidetracked by more immediate concerns. > > >> Fortunately, > > >>>>> this > > >>>>>> renewed interest is timely in that we have an HBase 2 project > > >> where, > > >>> if > > >>>>>> this can land in a 2.5 or a 2.6, it could be an important cost to > > >>> serve > > >>>>>> optimization, and one we could and would make use of. Therefore I > > >>> would > > >>>>>> like to restate my employer's interest in this work too. It may > > >> just > > >>> be > > >>>>>> Viraj and myself in the early days. > > >>>>>> > > >>>>>> I'm not sure how best to collaborate. We could review changes from > > >>> the > > >>>>>> original authors, new changes, and/or divide up the development > > >>> tasks. > > >>>> We > > >>>>>> can certainly offer our time for testing, and can afford the costs > > >> of > > >>>>>> testing against the S3 service. > > >>>>>> > > >>>>>> > > >>>>>> On Tue, May 18, 2021 at 12:16 PM Wellington Chevreuil < > > >>>>>> [email protected]> wrote: > > >>>>>> > > >>>>>>> Greetings everyone, > > >>>>>>> > > >>>>>>> HBASE-24749 has been proposed almost a year ago, introducing a > > >> new > > >>>>>>> StoreFile tracker as a way to allow for any hbase hfile > > >>> modifications > > >>>>> to > > >>>>>> be > > >>>>>>> safely completed without needing a file system rename. This seems > > >>>>> pretty > > >>>>>>> relevant for deployments over S3 file systems, where rename > > >>>> operations > > >>>>>> are > > >>>>>>> not atomic and can have a performance degradation when multiple > > >>>>> requests > > >>>>>>> get concurrently submitted to the same bucket. We had done > > >>>> superficial > > >>>>>>> tests and ycsb runs, where individual renames of files larger > > >> than > > >>>> 5GB > > >>>>>> can > > >>>>>>> take a few hundreds of seconds to complete. We also observed > > >>> impacts > > >>>> in > > >>>>>>> write loads throughput, the bottleneck potentially being the > > >>> renames. > > >>>>>>> > > >>>>>>> With S3 being an important piece of my employer cloud solution, > > >> we > > >>>>> would > > >>>>>>> like to help it move forward. We plan to contribute new patches > > >> per > > >>>> the > > >>>>>>> original design/Jira, but we’d also be happy to review changes > > >> from > > >>>> the > > >>>>>>> original authors, too. Please let us know if anyone has any > > >>> concerns, > > >>>>>>> otherwise we’ll start to self-assign issues on HBASE-24749 > > >>>>>>> > > >>>>>>> Wellington > > >>>>>>> > > >>>>>> > > >>>>>> > > >>>>>> -- > > >>>>>> Best regards, > > >>>>>> Andrew > > >>>>>> > > >>>>>> Words like orphans lost among the crosstalk, meaning torn from > > >>> truth's > > >>>>>> decrepit hands > > >>>>>> - A23, Crosstalk > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > -- Best regards, Andrew Words like orphans lost among the crosstalk, meaning torn from truth's decrepit hands - A23, Crosstalk
