On 08/12/17 19:15, Assaf Gordon wrote: > Hello Mark, > > First, > thank you for taking the time and effort > to test our development snapshot, and reporting results back. > This kind of feedback is critical in getting multibyte support ready. > > > Second, > I can confirm the behavior you are observing, reproduced here > with 'od' for easier output: > > ## POSIX single-byte locale: > > $ echo "ß" | LC_ALL=C src/fold --bytes --width 1 | od -tc -An > 303 \n 237 \n > $ echo "ß" | LC_ALL=C src/fold --width 1 | od -tc -An > 303 \n 237 \n > > ## UTF8 locale: > > $ echo "ß" | LC_ALL=en_CA.UTF-8 src/fold --bytes --width 1 | od -tc -An > 303 237 \n > > $ echo "ß" | LC_ALL=en_CA.UTF-8 src/fold --width 1 | od -tc -An > 303 237 \n > > > On 2017-12-08 05:04 AM, Mark Roberts wrote: >> When --bytes is not specified, the program treats '\b', '\r' and '\t' >> specially. It assumes a tab width of eight (compile-time #define) and >> attempts to keep track of what the output will look like. >> >> This is absolutely not what I expected. > > That is correct, and I share your sentiment: it also took me some time > to try and track down why it behaves this way, and whether it's by > design or a bug. > >> But of course, when the program >> was first written, the words byte and character meant the same thing for >> printable characters. Printable bytes. > > The reasoning for this behavior is explained in the OpenGroup's POSIX > standard page for fold, in the "RATIONAL" section: > http://pubs.opengroup.org/onlinepubs/9699919799/utilities/fold.html#tag_20_48_18 > > There, it is made clear: > "Historical versions of the fold utility assumed 1 byte was one > character and occupied one column position when written out. This is > no longer always true. > [....] > Note that although the width for the -b option is in bytes, a line is > never split in the middle of a character." > > Therefore, the current implementation (of the development version) is > correct. > >> I will attempt to suggest an improved text for the man-page so that >> others will not be surprised. > > I agree that once multibyte support is added to fold(1), the man pages, > the help screen and texi manual must be updated to clearly > indicate the "-b/--bytes" only applies to \b \t \r and never to > multibyte characters. > > If you find the time to send such a patch - great! > If not, I will add it sooner or later (hopefully sooner). > > As such I'm closing this bug report, but further discussion (and > patches) are welcomed by replying to this thread.
Note while splitting in the middle of a character is incorrect, it doesn't preclude approximate counting in --bytes. This is the approach the current i18n patch takes: $ export LC_ALL=en_CA.UTF-8 $ echo "ßß" | fold-i18n --bytes --width 1 | od -tc -An 303 237 \n 303 237 \n \n $ echo "ßß" | fold-i18n --bytes --width 2 | od -tc -An 303 237 \n 303 237 \n \n $ echo "ßß" | fold-assaf --bytes --width 2 | od -tc -An 303 237 303 237 \n The i18n version of fold also has a --characters option to operate in the current fold-assaf mode. I'm not convinced we want to be different from the i18n patch in this regard at least. cheers, Pádraig.