On Wed, May 20, 2026 at 1:52 PM Ivan Zhakov <[email protected]> wrote:
> On Tue, 19 May 2026 at 16:27, Timofei Zhakov <[email protected]> wrote: > >> On Mon, May 18, 2026 at 10:58 PM Timofei Zhakov <[email protected]> >> wrote: >> > >> > Long time has passed since it was coded. So to refresh the context, >> > whether you were following or not, the utf8-cmdline branch makes the >> > command-line tools to first convert entire argv into UTF-8 and then >> > assume normalised strings when handling parameters. A work that >> > minimises the amount of potentially forgotten-to-convert strings and >> > brings support for all Unicode symbols in the cmdline on Windows which >> > is impossible right now. >> > >> > There is a test called basic_tests.py:argv_with_best_fit_chars. It >> > checks that svn rejects Unicode symbols. Functionality which was >> > illegal before changes introduced in that branch. >> > >> > I believe the correct solution would be to adjust test cases to expect >> > successful execution instead of an error. I've tried to implement that >> > before by changing arguments in run_and_verify_svn2, but it made me >> > stuck... I kept getting an error message coming from the logger (?) >> > that it struggles to convert output to print it. >> > >> > I really lack expertise with Python and svn-test to figure out the >> > correct way to do it. Could anyone who knows it better than I do step >> > ahead and maybe give it a fresh look from another perspective? >> > >> > It's basically the last thing that keeps us from getting this branch >> > merged so far. I think it's time we finalise it. >> > >> > Alternatively I think we might remove that test as "outdated" because >> > it checks an edge case which is now completely normal after the >> > utf8ization. I would personally prefer the other option, but it's >> > worth mentioning this possibility as well. >> > >> > Thanks! >> > On Sat, May 16, 2026 at 3:06 PM Timofei Zhakov <[email protected]> >> wrote: >> > > >> > > On Fri, May 15, 2026 at 4:02 PM Jun Omae <[email protected]> wrote: >> > > > >> > > > On 2026/05/15 0:28, Timofei Zhakov wrote: >> > > > > There is a test called basic_tests.py:argv_with_best_fit_chars. It >> > > > > checks that svn rejects Unicode symbols. Functionality which was >> > > > > illegal before changes introduced in that branch. >> > > > >> > > > In the branch, svn command receives the arguments as utf-8 bytes, >> but the >> > > > output of the pipe is applied best-fit encoding conversion. >> > > > >> > > > [[[ >> > > > (diff is truncated) >> > > > ]]] >> > > >> > > I tested this patch and can confirm that it works. I don't know why >> > > but as far as I remember I was doing exactly the same thing, but it >> > > didn't work for me. >> > > >> > > I remember I once heard that "everything looks like physics if you >> > > don't know magic". That's exactly the case. Sometimes we just need a >> > > pair of fresh eyes. :-) >> > > >> > > +1 for the changes >> > > >> > > > Recently, I'm trying 1.14.x with utf-8 code page using >> activeCodePage >> > > > manifest [1]. It almost works fine (e.g. add emoji filenames and >> checkout, >> > > > ...) however output to stderr is garbled and not fixed yet. >> > > > >> > > > [1] >> https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page >> > > >> > > That sounds interesting. If as you are saying output is converted to >> > > the local encoding, it introduces a lot of inconsistency and yeah we >> > > have no emojis. >> > > >> > > Since it's almost always that the encoding is UTF-8 on the majority of >> > > Unix systems, I think it makes a lot of sense to take the same >> > > approach on Windows. >> > >> > I saw you committed the patch in r1934334. Thank you so much! >> > >> > I guess the CI is back to green! I'll post an email on dev@ about >> > merging the branch into trunk. >> >> Hey guys! >> >> I believe that it's time we merge the utf8-cmdline branch. I've >> already written the merge log message and pasted it below. Hope there >> is nothing missing because it ended up being pretty massive. >> >> +1. > > I agree that the change is massive. However, on trunk we're just at the > beginning of the (1.16.x) release cycle, so it's probably the most > appropriate moment of time to make a change like this. > > I have tested the branch on Windows: works great for me. > > I also reviewed the code and didn't find any problems. Please find some > minor comments inline. > > [[[ >> Switch command-line programs to UTF-8 mode. >> >> Merging from branch utf8-cmdline-prototype. >> >> The Subversion library internally works with UTF-8 encoded strings which >> simplifies it a lot. However, commands-line programs receive arguments in >> local encoding (cstring). Before this change, we used to convert >> encoding as >> > > Nitpick: "local**e** encoding". > > >> we handle each individual argument. But this approach has several >> limitations; >> >> 1) It's easy to miss some conversions since we need to constantly keep >> it in >> mind that the arguments are in non-UTF-8 encoding, unlike the majority of >> codebase assumes. Let's keep it consistent and simple. >> >> 2) Even though we could perfectly handle Non-ASCII (Unicode) strings in >> UTF-8 >> encoding under the hood, cstring squishes them to allow ASCII-only data. >> Switching command-line to UTF-8 would remove that layer of converting >> input >> data back and forth, which means direct conversions from UTF-16 to UTF-8 >> without any complications. >> >> Recently, we added a family of svn_cmdline__get_cstring_argv() function to >> resolve an issue with inconsistent encoding of what we assumed was >> cstring. >> Now, it's time to get into UTF-8, so we are introducing >> svn_cmdline__get_utf8_argv() for that. It uses >> svn_utf__win32_utf16_to_utf8() >> on Windows, and svn_utf_cstring_to_utf8() on Unix. >> >> This branch should not bring any behaviour changes on Unix, while Windows >> code >> now supports Unicode command-line arguments. >> >> * subversion/include/private/svn_cmdline_private.h >> (svn_cmdline__win32_get_utf8_argv, svn_cmdline__default_get_utf8_argv): >> Declare methods. >> (svn_cmdline__get_utf8_argv): New define. >> >> * subversion/include/svn_client.h >> (svn_client_args_to_target_array3): Declare new functions that do not >> perform UTF-8 conversion unlike its old variant. >> (svn_client_args_to_target_array2): Deprecate. >> >> * subversion/include/svn_opt.h >> (svn_opt_args_to_target_array4, svn_opt_parse_revprop2): Declare new >> functions that do not perform UTF-8 conversion unlike their old >> variant. >> (svn_opt_args_to_target_array3, svn_opt_parse_revprop): Deprecate. >> >> * subversion/libsvn_client/client.h >> (svn_client__process_target_array): Declare new private function to put >> common code of target collection to later use in either UTF8 or >> non-UTF8 >> versions. >> >> * subversion/libsvn_client/cmdline.c >> (svn_client__process_target_array): Implement function, using body of >> former svn_client_args_to_target_array2() function. >> (svn_client_args_to_target_array3): Implement by forwarding targets to >> svn_client__process_target_array() as they are without any conversions. >> >> * subversion/libsvn_client/deprecated.c >> (svn_client_args_to_target_array2): Implement backward compatibility >> function in a similar way as svn_client_args_to_target_array3(), but >> also >> converting targets to UTF8 encoding. >> >> * subversion/libsvn_subr/cmdline.c >> (svn_cmdline__win32_get_utf8_argv): New function that receives a >> collection >> of argv in local Windows encoding and converts them to UTF8 using >> svn_utf__win32_utf16_to_utf8. >> (svn_cmdline__default_get_utf8_argv): New function that ensures the >> exact >> same behaviour in terms of encoding conversions performed as our >> command-line code used to do it before on Unix platforms since it is >> backed >> by svn_utf_cstring_to_utf8(). >> >> * subversion/libsvn_subr/deprecated.c >> (svn_opt_args_to_target_array3): Use svn_opt_parse_all_args() method to >> back >> the implementation. >> (svn_opt_parse_revprop): Implement wrapper over svn_opt_parse_revprop2 >> with >> UTF8 conversion. >> >> * subversion/libsvn_subr/opt.c >> (svn_opt__args_to_target_array): New function implementation. >> (svn_opt_parse_revprop2): Bump version and remove UTF8 conversion. >> >> * subversion/svn/auth-cmd.c >> * subversion/svn/propdel-cmd.c >> * subversion/svn/propedit-cmd.c >> * subversion/svn/propget-cmd.c >> * subversion/svn/propset-cmd.c >> * subversion/svn/shelf-cmd.c >> * subversion/svn/shelf2-cmd.c >> (svn_cl__auth, >> svn_cl__changelist, >> svn_cl__propdel, >> svn_cl__propedit, >> svn_cl__propget, >> svn_cl__propedit, >> svn_cl__propget, >> svn_cl__propset, >> get_next_argument): Remove conversion to UTF8 encoding of target >> arguments. >> >> * subversion/svn/svn.c >> (parse_compatible_version): Do not convert the encoding of value of the >> opt_compatible_version argument. >> >> (sub_main): Use svn_cmdline__get_utf8_argv instead of >> svn_cmdline__get_cstring_argv to get normalized arguments. >> (sub_main): Read opt_arg directly to utf8_opt_arg and remove all >> conversion >> from cstring to UTF8. >> (sub_main): For some options which want cstring, convert utf8_opt_arg >> back. >> Those include --diff-cmd, --merge-cmd, and --editor-cmd. >> (sub_main): Use a newer version of svn_opt_parse_revprop, >> svn_opt_parse_revprop >> which does not convert strings to UTF8. >> (sub_main): Do not utf8-ize first_arg. >> (sub_main): Do not convert the message to UTF8. >> >> * subversion/svn/util.c subversion/svn/util.c >> (svn_cl__make_log_msg_baton): Consider --encoding argument only when the >> message is read from a file, keeping it in UTF8 when --message is >> utilised >> instead. >> >> (svn_cl__args_to_target_array_print_reserved): Use >> svn_client_args_to_target_array3 instead of >> svn_client_args_to_target_array2, >> which doesn't convert result to utf8, since we are working with already >> converted 'os'. >> >> * subversion/svnadmin/svnadmin.c >> (parse_args): No need to command-line arguments to UTF8. >> (set_revprop): Explicitly state UTF8 encoding when invoking the >> svn_subst_translate_string2() since the arguments should already be in >> this >> encoding. This ensures that no double conversion would be done. >> (subcommand_lslocks): Use svn_opt_args_to_target_array4() instead of >> svn_opt_args_to_target_array3() to retrieve targets which does not >> convert >> encodings. >> (sub_main): Use svn_cmdline__get_utf8_argv() instead of >> svn_cmdline__get_cstring_argv() to get UTF8 argv. >> (sub_main): Read opt_arg directly to utf8_opt_arg and remove all >> conversion >> from cstring to UTF8. >> (sub_main): Do not utf8-ize first_arg. >> >> * subversion/svnbench/svnbench.c >> (sub_main): Convert args straight to UTF8, remove in-place conversions, >> and >> silently migrate to handle the other arguments as UTF8. >> >> * subversion/svnbench/null-list-cmd.c >> (includes): Remove svn_utf.h to make sure no more UTF8 conversions are >> done. >> >> * subversion/svnbench/util.c >> (svn_cl__args_to_target_array_print_reserved): Use newer version of >> svn_client_args_to_target_array(), which doesn't perform UTF8 >> conversion >> >> (sub_main): Directly convert argv to UTF8 and remove all following >> conversions. >> >> * subversion/svnlook/svnlook.c >> (print_diff_tree): Don't convert diff_options to UTF8 twice. >> (sub_main): Convert args straight to UTF8, remove in-place conversions, >> and >> handle the other arguments as UTF8. >> >> * subversion/svnfsfs/svnfsfs.c >> * subversion/svnmucc/svnmucc.c >> * subversion/svndumpfilter/svndumpfilter.c >> * subversion/svnrdump/svnrdump.c >> * subversion/svnserve/svnserve.c >> * subversion/svnversion/svnversion.c >> * tools/client-side/svn-mergeinfo-normalizer/svn-mergeinfo-normalizer.c >> * tools/client-side/svnconflict/svnconflict.c >> * tools/dev/svnraisetreeconflict/svnraisetreeconflict.c >> * tools/dev/wc-ng/svn-wc-db-tester.c >> * tools/server-side/svnauthz.c >> (includes): Remove svn_utf8.h where possible as we no longer use its >> functions. >> (sub_main): Convert all args straight to UTF8 and remove later >> conversion. >> >> * subversion/svnsync/svnsync.c >> (initialize_cmd, synchronize_cmd, copy_revprops_cmd, info_cmd): Use >> svn_opt_args_to_target_array4() instead of >> svn_opt_args_to_target_array3() >> and remove svn_utf_cstring_to_utf8() calls. >> >> (sub_main): Convert args straight to UTF8, remove in-place conversions, >> and >> handle the other arguments as UTF8. >> >> * subversion/tests/cmdline/svntest/testcase.py >> (_detect_utf8_locale): New local helper. >> (RequireUtf8_deco): Add RequireUtf8 decorator that allows test to be >> run in >> UTF8 environment to fix failures on Unix platforms. >> >> * subversion/tests/cmdline/basic_tests.py >> (): Import RequireUtf8 decorator from testcase.py >> (argv_with_best_fit_chars): >> Adapt the unit test to the arguments in utf-8 bytes. >> (unicode_arguments_test): Use that RequireUtf8 decorator to force the >> test to >> run in the UTF8 environment. >> >> > I am +1 to merge utf8-cmdline branch to trunk. > > > Thanks for the review! Completed the merge in r1935602. > > * subversion/tests/libsvn_subr/opt-test.c > > (includes): Add svn_hash.h. > > (test_svn_opt_parse_revprop): Add new test. > > > Probably it would be better to #undef UNICODE_TEST_STRING or just inline it. I decided not to change this part yet. But probably yeah but it doesn't really matter and we've done either way before. -- Timofei Zhakov

