Em qui., 19 de ago. de 2021 às 08:50, Jelte Fennema < jelte.fenn...@microsoft.com> escreveu:
> ## What is this? > > It's a draft patch that replaces code like this: > ```c > pg_file_unlink(PG_FUNCTION_ARGS) > { > char *filename; > requireSuperuser(); > filename = convert_and_check_filename(PG_GETARG_TEXT_PP(0)); > ``` > > With this shorter version: > ```c > pg_file_unlink(PG_FUNCTION_ARGS) > { > requireSuperuser(); > char *filename = convert_and_check_filename(PG_GETARG_TEXT_PP(0)); > ``` > > ## Why would we want this? > 1. It removes 23328 lines of code that don't have any impact on how the > code behaves [1]. This is roughly 2.7% of all the lines of code in the > codebase. > 2. Declarations are closer to the actual usage. This is advised by the > "Code Complete" book [2] and has the following advantages: > a. This limits variable scope to what is necessary. Which in turn makes > the mental model you have to keep of a function when reading the code > simpler. > b. In most cases it allows you to see the the type of a variable > without needing to go to the top of the function. > 3. You can do input checking and assertions at the top of the function, > instead of having to put declarations in front of it. This makes it clear > which invariants hold for the function. (as seen in the example above and > the changes for pg_file_rename[3]) > 4. Declaring variables after statements is allowed in C99. Postgres > already requires C99, so it might as well use this feature too. > > ## How was this changeset created? > > I created a script that modifies all C files in the following way: > > 1. Find a non `static` declaration. > 2. If it has an initializer and it is not a single variable/constant, > don't consider replacing it. (reason: a function call initializer might > have sideffects). > 3. Otherwise (it's a simple initializer or it has no initializer at all), > take the type and variable from that declaration. > 4. Find the next use of the variable. > 5. If the next use is not an assignment, don't do anything (the value of > the original initialization is used). > 6. If the next use is an assignment: > 1. Remove the old declaration > 2. Prefix the found assignment with the type > 3. Unless the variable is also used in the same line of the new > initialization, e.g: > ```c > int a = 1; > a = a + a; > ``` > > ## How does this script work? > > It uses a Perl regex to search and replace! (obligatory jokes at the > bottom of the email) This regex is based on the ones used in this PR to > citus[4] and the similar PR to pg_auto_failover[5]. The script is in the > 3rd commit of this patchset. > > To visualize the regex in the script in a reasonable way, copy paste the > search part of it: > ``` > \n\t(?!(return|static)\b)(?P<type>(\w+[\t ])+[\t *]*)(?>(?P<variable>\w+)( > = > [\w>\s\n-]*?)?;\n(?P<code_between>(?>(?P<comment_or_string_or_not_preprocessor>\/\*.*?\*\/|"(?>\\"|.)*?"|(?!goto)[^#]))*?)(\t)?(?=\b(?P=variable)\b))(?<=\n\t)(?<!:\n\t)(?P=variable) > =(?![^;]*?[^>_]\b(?P=variable)\b[^_]) > ``` > > And paste that into https://www.debuggex.com/, then select PCRE from the > dropdown. (Sharing seems to be down at this point, so this is the only way > to do it at the moment) Try it out! The regex is not as crazy as it looks. > > There's two important assumptions this regex makes: > 1. Code is indented using tabs, and the intent level determines the scope. > (this is true, because of `pgindent`) > 2. Declared variables are actually used. (this is true, because we would > get compiler warnings otherwise) > > There's two cases where this regex has some special behaviour: > 1. Stop searching for the first usage of a variable when either a `goto` > or a preprocessor command is found (outside a string or comment). These > things affect the control flow in a way that the regex does not understand. > (any `#` character is assumed to be a preprocessor commands). > 2. Don't replace if the assignment is right after a `label:`, by checking > if there was a `:` as the last character on the previous line. This works > around errors like this: > ``` > hashovfl.c:865:3: error: a label can only be part of a statement and a > declaration is not a statement > OffsetNumber maxroffnum = PageGetMaxOffsetNumber(rpage); > ^~~~~~~~~~~~ > ``` > Detecting this case in this way is not perfect, because sometimes there is > an extra newline or a comment between the label and the assignment. This is > not easily detectable by the regex, because lookbehinds cannot have a > variable length in Perl (or most regex engines for that matter). For these > few cases (5 in the entire codebase) a manual change was done either before > or after the automatic replacement to fix them so the code compiles again. > (2nd and 5th commits of this patchset) > > After all these changes `make -s world` doesn't show any warnings or > errors and `make check-world` succeeds. I configured compilation like this: > ``` > ./configure --enable-cassert --enable-depend --enable-debug --with-openssl > --with-libxml --with-libxslt --with-uuid=e2fs --with-icu > ``` > > ## What I want to achieve with this email > > 1. For people to look at a small sample of the changes made by this > script. I will try to send patches with the actual changes made to the code > as attachments in a followup email. However, I don't know if that will > succeed since the patch files are very big and it might run into some > mailserver limits. So in case that doesn't work, the full changeset is > available on Github[6]. This make viewing this enormous diff quite a bit > easier IMHO anyaway. If you see something weird that is not covered in the > "Known issues" section below, please share it. That way it can be discussed > and/or fixed. > 2. A discussion on if this type of code change would be a net positive for > Postgres codebase. Please explain clearly why or why not. > 3. Some links to resources on what's necessary to get a big refactoring > patch like this merged. > > ## What I don't want to achieve with this email > > 1. For someone to go over all the changes right now. There's likely to be > changes to the script or something else. Doing a full review of the changes > would be better saved for later during a final review. > > ## Known issues with the currently generated code > > There's a few issues with the final generated code that I've already > spotted. These should all be relatively easy to fix in an automated way. > However, I think doing so is probably better done by `pgindent` or some > other auto formatting tool, instead of with the regex. Note that I did run > `pgindent`, it just didn't address these things: > > 1. Whitespace between type and variable is kept the same as it was before > moving the declaration. If the original declaration had whitespace added in > such a way that all variable names of declarations aligned, then this > whitespace is preserved. This looks weird in various cases. See [3] for an > example. > 2. `pgindent` adds a newline after each block of declarations, even if > they are not at the start of function. If this is desired is debatable, but > to me it seems like it adds excessive newlines. See [7] for an example. > 3. If all declarations are moved away from the top of the function, then > an empty newline is kept at the top of the function. This seems like an > unnecessary newline to me. See [3] for an example. > > Sources: > 1. `tokei`[8] results of lines of code: > Before: > ``` > $ tokei --type C > > =============================================================================== > Language Files Lines Code Comments > Blanks > > =============================================================================== > C 1389 1361309 866514 330933 > 163862 > > =============================================================================== > Total 1389 1361309 866514 330933 > 163862 > > =============================================================================== > ``` > After: > ``` > $ tokei --type C > > =============================================================================== > Language Files Lines Code Comments > Blanks > > =============================================================================== > C 1389 1347515 843186 330952 > 173377 > > =============================================================================== > Total 1389 1347515 843186 330952 > 173377 > > =============================================================================== > ``` > 2. > https://github.com/mgp/book-notes/blob/master/code-complete.markdown#103-guidelines-for-initializing-variables > 3. > ```patch > @@ -401,11 +389,10 @@ pg_file_rename_internal(text *file1, text *file2, > text *file3) > Datum > pg_file_unlink(PG_FUNCTION_ARGS) > { > - char *filename; > > requireSuperuser(); > > - filename = convert_and_check_filename(PG_GETARG_TEXT_PP(0)); > + char *filename = convert_and_check_filename(PG_GETARG_TEXT_PP(0)); > > if (access(filename, W_OK) < 0) > ``` > 4. https://github.com/citusdata/citus/pull/3181 > 5. https://github.com/citusdata/pg_auto_failover/pull/266 > 6. https://github.com/JelteF/postgres/pull/2 > 7. > ```patch > @@ -2863,7 +2886,8 @@ palloc_btree_page(BtreeCheckState *state, > BlockNumber blocknum) > * longer than we must. > */ > Buffer buffer = ReadBufferExtended(state->rel, MAIN_FORKNUM, blocknum, > RBM_NORMAL, > - state->checkstrategy); > + state->checkstrategy); > + > LockBuffer(buffer, BT_READ); > > /* > ``` > 8. https://github.com/XAMPPRocky/tokei > > > Obligatory jokes: > 1. https://xkcd.com/1171/ > 2. https://xkcd.com/208/ > 3. https://stackoverflow.com/a/1732454/2570866 (and serious response to > it > https://neilmadden.blog/2019/02/24/why-you-really-can-parse-html-and-anything-else-with-regular-expressions/ > ) > -1 in short. C needs readability, not fewer lines. Aside from horrible code, it doesn't improve 0.1% on anything. I think it's a bad idea and I'm strongly against it. regards, Ranier Vilela