Hello, I'm looking into adding multibyte support to tr(1), and interested in some feedback.
1. "-C" vs "-c" --------------- The POSIX tr(1) page says: "-c Complement the set of values specified by string1. -C Complement the set of characters specified by string1." ( http://pubs.opengroup.org/onlinepubs/9699919799/utilities/tr.html ) This I take to mean: "-c" is single-bytes (=values) regardless of locale, "-C" is multibyte characters, depending on locale. First, Is the above correct? Second, Assuming it is correct, is the following expected output correct? The UTF-8 sequence '\316\243' is U+03A3 GREEK CAPITAL LETTER SIGMA 'Σ'. The UTf-8 sequence '\316\250' is U+03A8 GREEK CAPITAL LETTER PSI 'Ψ'. POSIX unibyte locale and lower-case "-c": printf '\316\243\316\250' | LC_ALL=C tr -dc '\316\250' => '\316\316\250' UTF-8 locale but lower-case "-c", input set should be treated as two separate single-byte octets: printf '\316\243\316\250' | LC_ALL=en_US.UTf-8 tr -dc '\316\250' => '\316\316\250' POSIX unibyte locale and upper-case "-C", input set should be treated as two separate single-byte octets: printf '\316\243\316\250' | LC_ALL=C tr -dC '\316\250' => '\316\316\250' UTF-8 locale with upper-case "-C", input is a one multibyte character: printf '\316\243\316\250' | LC_ALL=en_US.UTF-8 tr -dC '\316\250' => '\316\250' 2. Invalid multibyte sequences in SET1/SET2 parameters ------------------------------------------------------ I assume that invalid multibyte sequences in the *input* file must be outputed as-is (in accordance with other coreutils programs). However, what about invalid sequences in SET1/SET2 parameters? Can we reject them (and fail/refuse to run) ? That is, in POSIX locale, both of these are valid and mean the same thing (delete two octet values): LC_ALL=C tr -d '\316\250' LC_ALL=C tr -d '\250\316' But in UTF8 locale, should we accept the invalid sequence: LC_ALL=en_US.UTF8 tr -d '\250\316' and treat it (silently) as two separate octets, or should we exit with an error message (e.g. "SET1 is not valid in this locale") ? 3. backward incompatibility --------------------------- Also related to the previous item, I think tr(1) might be a case where adding multibyte support might break existing scripts, and be seen as a regression by users. If someone used commands like tr -d '\200-\377' tr -d '\316\250' And these have worked for many years regardless of locale, adding multibyte support might disrupt this. What do you think ? perhaps this usage is not so common, and it won't be too big of a disruption ? thanks for reading, - assaf
