Re: why would "tr --complement --squeeze-repeats ..." append the substitution char once more? ...
On 12/11/23, Greg Wooledge wrote: > 1) Many implementations of echo will interpret parts of their argument(s), >in addition to processing options like -n. If you want to print a >variable's contents to standard output without *any* interpretation, >use printf. > > printf %s "$myvar" > printf '%s\n' "$myvar" > I will use "printf ..." from now on. > 2) As tomas already told you, the square brackets in > > tr -c -s '[A-Za-z0-9.]' _ > >are literal. You're using a command which will keep left and right >square brackets in the input, *not* replacing them with underscores. >This may not be what you want. My mistake, even though it didn't get in the way of what I was trying to do. I replaced :alnum: by what I thought it meant and left the brackets. > 3) In locales other than C or POSIX, ranges like A-Z are *not* necessarily >synonyms for [:upper:]. As I've already mentioned, GNU tr is known to >contain bugs, so you're getting lucky here. The bugs in GNU tr happen >to work the way you're expecting, so that A-Z is treated like [:upper:] >when it should not be. If at some point in the future GNU tr is fixed >to conform to POSIX, your script may break. > >The correct tr command you should be using if you want to retain >accented letters (as defined in your locale) is: > > tr -c -s '[:alnum:].' _ > >If you want to discard accented letters, then either of these is OK: > > LC_COLLATE=C tr -c -s '[:alnum:].' _ > LC_COLLATE=C tr -c -s 'A-Za-z0-9.' _ > I like your second one liner much better (LC_COLLATE=C tr -c -s 'A-Za-z0-9.' _) I tend to avoid '[:alnum:].' because the intended meaning of "ALphabetic et NUMeric" characters, even though it depends on the locale has a strong ASCII accent to it. > Thus, we come full circle. Yes, we did. Thank you, lbrtchx
Re: why would "tr --complement --squeeze-repeats ..." append the substitution char once more? ...
Albretch Mueller wrote: > echo "abc123" > file.txt > ftype=$(file --brief file.txt) > echo "// __ \$ftype: |${ftype}|" > ftypelen=${#ftype} > echo "// __ \$ftypelen: |${ftypelen}|" > > # removing spaces ... > ftype2=$(echo "${ftype}" | tr --complement --squeeze-repeats > '[A-Za-z0-9.]' '_'); > echo "// __ \$ftype2: |${ftype2}|" > ftype2len=${#ftype2} > echo "// __ \$ftype2len: |${ftype2len}|" > > lbrtchx Short answer. tr doesn't append anything. echo does output a linefeed at the end of the string, unless you stop it. tr dutifully translates that to an underscore.
Re: "echo" literally in sh scripts (was: why would "tr --complement --squeeze-repeats ..." append the substitution char once more? ...)
On Mon, Dec 11, 2023 at 10:16:35AM -0500, Stefan Monnier wrote: > > 1) Many implementations of echo will interpret parts of their argument(s), > >in addition to processing options like -n. If you want to print a > >variable's contents to standard output without *any* interpretation, > >use printf. > > > > printf %s "$myvar" > > printf '%s\n' "$myvar" > > Interesting. I used the following instead: > > bugit_echo () { > # POSIX `echo` has all kinds of "features" we don't want, such as > # handling of \c and -n. > cat < $* > ENDDOC > } That requires an external command (one fork), plus whatever overhead is used by the << implementation (temp file or pipe, depending on shell and version). It's not wrong, but an implementation using nothing but builtins is usually preferable. echo() { printf '%s\n' "$*"; } It's also worth mentioning that both of these rely on the expansion of $* with a default or nearly-default IFS variable. If you want it to work when IFS may have been altered, you can do this in bash: echo() { local IFS=' '; printf '%s\n' "$*"; } In sh, you'd need to fork a subshell: echo() { (IFS=' '; printf '%s\n' "$*"); } Or if you're a golfer: echo() (IFS=' '; printf '%s\n' "$*") I *really* dislike that syntax, but that's just me. Some people use it.
"echo" literally in sh scripts (was: why would "tr --complement --squeeze-repeats ..." append the substitution char once more? ...)
> 1) Many implementations of echo will interpret parts of their argument(s), >in addition to processing options like -n. If you want to print a >variable's contents to standard output without *any* interpretation, >use printf. > > printf %s "$myvar" > printf '%s\n' "$myvar" Interesting. I used the following instead: bugit_echo () { # POSIX `echo` has all kinds of "features" we don't want, such as # handling of \c and -n. cat <
Re: why would "tr --complement --squeeze-repeats ..." append the substitution char once more? ...
On Mon, Dec 11, 2023 at 09:55:54AM -0500, Greg Wooledge wrote: [...] Greg, your analyses are always impressive. And enjoyable. Thanks for this cheers -- t signature.asc Description: PGP signature
Re: why would "tr --complement --squeeze-repeats ..." append the substitution char once more? ...
On Mon, Dec 11, 2023 at 02:00:49PM +, Albretch Mueller wrote: > Ach, yes! I forgot echo by default appends a new line character at > the end of every string it spits out. In order to suppress it you need > to use the "n" option: "echo -n ..." > > _FL_TYPE=" abc á é í ó ú ü ñ Á É Í Ó Ú Ü Ñ 123 birdiehere ¿ ¡ § > ASCII ä ö ü ß Ä Ö Ü Text" > echo "// __ \$_FL_TYPE: |${_FL_TYPE}|" > _FL_TYPE=$(echo "${_FL_TYPE}" | xargs) > echo "// __ \$_FL_TYPE: |${_FL_TYPE}|" > _FL_TYPE=$(echo -n "${_FL_TYPE}" | tr --complement --squeeze-repeats > '[A-Za-z0-9.]' '_'); > echo "// __ \$_FL_TYPE: |${_FL_TYPE}|" > > // __ $_FL_TYPE: | abc á é í ó ú ü ñ Á É Í Ó Ú Ü Ñ 123 birdiehere > ¿ ¡ § ASCII ä ö ü ß Ä Ö Ü Text| > // __ $_FL_TYPE: |abc á é í ó ú ü ñ Á É Í Ó Ú Ü Ñ 123 birdiehere ¿ ¡ > § ASCII ä ö ü ß Ä Ö Ü Text| > // __ $_FL_TYPE: |abc_123_birdie_here_ASCII_Text| OK. Tomas's analysis was better than mine in this case. Looks like CR was not the issue this time around. I do have some comments, though. 1) Many implementations of echo will interpret parts of their argument(s), in addition to processing options like -n. If you want to print a variable's contents to standard output without *any* interpretation, use printf. printf %s "$myvar" printf '%s\n' "$myvar" 2) As tomas already told you, the square brackets in tr -c -s '[A-Za-z0-9.]' _ are literal. You're using a command which will keep left and right square brackets in the input, *not* replacing them with underscores. This may not be what you want. 3) In locales other than C or POSIX, ranges like A-Z are *not* necessarily synonyms for [:upper:]. As I've already mentioned, GNU tr is known to contain bugs, so you're getting lucky here. The bugs in GNU tr happen to work the way you're expecting, so that A-Z is treated like [:upper:] when it should not be. If at some point in the future GNU tr is fixed to conform to POSIX, your script may break. The correct tr command you should be using if you want to retain accented letters (as defined in your locale) is: tr -c -s '[:alnum:].' _ If you want to discard accented letters, then either of these is OK: LC_COLLATE=C tr -c -s '[:alnum:].' _ LC_COLLATE=C tr -c -s 'A-Za-z0-9.' _ 4) The xargs command, which you used above, uses quotation mark characters as well as whitespace to define input words. Your example worked only because your input does not contain any single or double quotes. Here's a demonstration of A-Z not equating to [:upper:] using GNU sed, which is behaving correctly: unicorn:~$ x=' abc á é í ó ú ü ñ Á É Í Ó Ú Ü Ñ 123 birdiehere ¿ ' unicorn:~$ printf '%s\n' "$x" | sed 's/[A-Z]//g' abc á é í ó ú ü ñ123 birdiehere ¿ unicorn:~$ printf '%s\n' "$x" | LC_COLLATE=C sed 's/[A-Z]//g' abc á é í ó ú ü ñ Á É Í Ó Ú Ü Ñ 123 birdiehere ¿ The meaning of [A-Z] in the sed command depends on the locale. In my locale, which is en_US.utf8, characters like Á are part of the A-Z range. In the C locale, they aren't, as seen in the last command above. The use of [A-Z] in regular expressions and globs is a *very* heavily debated topic, and I'm only scratching the surface here. Honestly, you really should avoid using it. It's just too unpredictable. Here's an example of xargs failing when its input contains a quote: unicorn:~$ echo 'foo "bar' | xargs xargs: unmatched double quote; by default quotes are special to xargs unless you use the -0 option foo You can't use xargs to normalize whitespace safely. In fact, the proper way to normalize whitespace is... unicorn:~$ printf 'foo "bar \t\t \t baz \n' | tr -s ' \t' ' ' foo "bar baz Thus, we come full circle.
Re: why would "tr --complement --squeeze-repeats ..." append the substitution char once more? ...
On 11/12/2023 21:00, Albretch Mueller wrote: // __ $_FL_TYPE: |abc á é í ó ú ü ñ Á É Í Ó Ú Ü Ñ 123 birdiehere ¿ ¡ § ASCII ä ö ü ß Ä Ö Ü Text| // __ $_FL_TYPE:|abc_123_birdie_here_ASCII_Text| https://pypi.org/project/Unidecode/ should be more friendly to languages other than English.
Re: why would "tr --complement --squeeze-repeats ..." append the substitution char once more? ...
Ach, yes! I forgot echo by default appends a new line character at the end of every string it spits out. In order to suppress it you need to use the "n" option: "echo -n ..." _FL_TYPE=" abc á é í ó ú ü ñ Á É Í Ó Ú Ü Ñ 123 birdiehere ¿ ¡ § ASCII ä ö ü ß Ä Ö Ü Text" echo "// __ \$_FL_TYPE: |${_FL_TYPE}|" _FL_TYPE=$(echo "${_FL_TYPE}" | xargs) echo "// __ \$_FL_TYPE: |${_FL_TYPE}|" _FL_TYPE=$(echo -n "${_FL_TYPE}" | tr --complement --squeeze-repeats '[A-Za-z0-9.]' '_'); echo "// __ \$_FL_TYPE: |${_FL_TYPE}|" // __ $_FL_TYPE: | abc á é í ó ú ü ñ Á É Í Ó Ú Ü Ñ 123 birdiehere ¿ ¡ § ASCII ä ö ü ß Ä Ö Ü Text| // __ $_FL_TYPE: |abc á é í ó ú ü ñ Á É Í Ó Ú Ü Ñ 123 birdiehere ¿ ¡ § ASCII ä ö ü ß Ä Ö Ü Text| // __ $_FL_TYPE: |abc_123_birdie_here_ASCII_Text| Thank you, lbrtchx
Re: why would "tr --complement --squeeze-repeats ..." append the substitution char once more? ...
On Mon, Dec 11, 2023 at 02:11:46PM +0100, to...@tuxteam.de wrote: > On Mon, Dec 11, 2023 at 07:42:10AM -0500, Greg Wooledge wrote: > > Looks like GNU tr in Debian 12 still doesn't handle multibyte characters > > correctly: > > > > unicorn:~$ echo 'mañana' | tr ñ X > > maXXana > > Hey, you just gave us a handy way to count how many encoding units > a character takes: > > tomas@trotzki:~$ echo 'birdiehere' | tr -c 'a-z' X > birdiehereX Cute as that is, there are better ways. unicorn:~$ x=ñ; (echo "${#x}"; LC_ALL=C; echo "${#x}") 1 2
Re: why would "tr --complement --squeeze-repeats ..." append the substitution char once more? ...
On Mon, Dec 11, 2023 at 07:42:10AM -0500, Greg Wooledge wrote: > On Mon, Dec 11, 2023 at 09:37:42AM +0100, to...@tuxteam.de wrote: > > 2. This is tr, not regexp, so '[A-Za-z0-9.]' isn't doing what you > >think it does. It will match '[', 'A' to 'Z', 'a' to 'z','.' and > >']'. I guess you want to say 'A-Za-z0-9.' > > Well spotted. > > > 3. As a convenience, tr has char classes. Perhaps [:alnum:] is for > >you. No idea whether this is a GNU extension > > It's POSIX. 100% portable, as long as you ignore any bugs in GNU tr. > > Looks like GNU tr in Debian 12 still doesn't handle multibyte characters > correctly: > > unicorn:~$ echo 'mañana' | tr ñ X > maXXana > > So... as long as you're working in the C locale, where [:alnum:] is > just the ASCII capital and lowercase letters and digits, you should be > fine. Hey, you just gave us a handy way to count how many encoding units a character takes: tomas@trotzki:~$ echo 'birdiehere' | tr -c 'a-z' X birdiehereX ;-) Cheers -- t signature.asc Description: PGP signature
Re: why would "tr --complement --squeeze-repeats ..." append the substitution char once more? ...
On Mon, Dec 11, 2023 at 09:37:42AM +0100, to...@tuxteam.de wrote: > 2. This is tr, not regexp, so '[A-Za-z0-9.]' isn't doing what you >think it does. It will match '[', 'A' to 'Z', 'a' to 'z','.' and >']'. I guess you want to say 'A-Za-z0-9.' Well spotted. > 3. As a convenience, tr has char classes. Perhaps [:alnum:] is for >you. No idea whether this is a GNU extension It's POSIX. 100% portable, as long as you ignore any bugs in GNU tr. Looks like GNU tr in Debian 12 still doesn't handle multibyte characters correctly: unicorn:~$ echo 'mañana' | tr ñ X maXXana So... as long as you're working in the C locale, where [:alnum:] is just the ASCII capital and lowercase letters and digits, you should be fine.
Re: why would "tr --complement --squeeze-repeats ..." append the substitution char once more? ...
On Mon, Dec 11, 2023 at 11:25:13AM +, Albretch Mueller wrote: > In the case of: "ASCII text" > what should come out of it is: "ASCII_text" > not: "ASCII_text_" > no underscore at the end. That is the question I have. OK, here's my guess. The lines of code that you showed us are not actually in a script. They're just in a FILE, and you're running a command like this: sh myfile Furthermore, I am guessing that the lines of code in this file have Microsoft CR+LF line endings. Therefore, when you do a variable assignment like ftype=$(file --brief "$whatever") you end up with a Carriage Return character at the end of the variable's content (because there is one at the end of this command). Since you never actually SHOWED US the command you ran, or the output that was produced, which could have made this really, really obvious, we're forced to guess. My guess might be right, or wrong. But it's the best guess I have with the limited information you've chosen to share with us. What I mean by "obvious" is this. Here's part of your code: echo "abc123" > file.txt ftype=$(file --brief file.txt) echo "// __ \$ftype: |${ftype}|" If my guess is correct, you got output that looks like this: |/ __ $ftype: |ASCII text Showing this would have made it immediately clear that a CR is involved.
Re: why would "tr --complement --squeeze-repeats ..." append the substitution char once more? ...
On Mon, Dec 11, 2023 at 11:25:13AM +, Albretch Mueller wrote: > "tr --complement --squeeze-repeats ..." makes sure that the replaced > characters only appear once (that it doesn't immediately repeat). Say > you have something like " " (two spaces) or "?$|" (three characters) > which will be replaced by just an underscore. ...which would change the length, as I wrote. > In the case of: "ASCII text" > what should come out of it is: "ASCII_text" > not: "ASCII_text_" > no underscore at the end. That is the question I have. That depends on whether your "ASCII text" has some thingy at the end which you don't see. A newline, perchance? > I use such constructs as: "[A-Za-z0-9.]" to make explicit to myself > and other people what I mean. I work in corpora research dealing with > text based various alphabets not just in ASCII so I avoid any kinds of > linguistic/cultural shortcuts and abbreviations. What has this to do with how tr works? It will treat [ and ] as characters not to substitute. I pointed that out, because it might have been unintended: echo -n 'This is a text with [some brackets] in it' | tr -cs "[A-Za-z0-9.]" "_" This_is_a_text_with_[some_brackets]_in_it (Note this "-n" on the echo, btw? Without it, I'd be getting a "_" at the end, the transliterated newline). Do whatever you want :-) Cheers -- t signature.asc Description: PGP signature
Re: why would "tr --complement --squeeze-repeats ..." append the substitution char once more? ...
"tr --complement --squeeze-repeats ..." makes sure that the replaced characters only appear once (that it doesn't immediately repeat). Say you have something like " " (two spaces) or "?$|" (three characters) which will be replaced by just an underscore. In the case of: "ASCII text" what should come out of it is: "ASCII_text" not: "ASCII_text_" no underscore at the end. That is the question I have. I use such constructs as: "[A-Za-z0-9.]" to make explicit to myself and other people what I mean. I work in corpora research dealing with text based various alphabets not just in ASCII so I avoid any kinds of linguistic/cultural shortcuts and abbreviations. lbrtchx On 12/11/23, to...@tuxteam.de wrote: > On Mon, Dec 11, 2023 at 08:04:06AM +, Albretch Mueller wrote: >> On 12/11/23, Greg Wooledge wrote: >> > Please tell us ... >> >> OK, here is what I did as a t-table > > [...] > > Your style is confusing, to say the least. Why not play with minimal > examples and work your way up from that? > >> the two strings are not the same length even though your are just >> replacing ASCII characters, why did: >> echo "${ftype}" | tr --complement --squeeze-repeats '[A-Za-z0-9.]' '_' >> place a character at the end? > > Two things stick out: > > 1. with --squeeze-repeats you are challenging tr to output less >characters than the input has: > >trotzki:~$ echo -n "this is a # string ###" | tr -cs 'a-z' '_' >=> this_is_a_string_ > >(I allowed myself to simplify things a bit) See? tr is squeezing >repeats (repeated matches), the space-plus-three-hashes at the >end gets squeezed to just one _, thus changing the length. >If your strings contain more than one non-alphanumeric (something >I don't feel like even trying a guess at), this is bound to happen. >You ordered it. > > 2. This is tr, not regexp, so '[A-Za-z0-9.]' isn't doing what you >think it does. It will match '[', 'A' to 'Z', 'a' to 'z','.' and >']'. I guess you want to say 'A-Za-z0-9.' > > 3. As a convenience, tr has char classes. Perhaps [:alnum:] is for >you. No idea whether this is a GNU extension > > 4. In case of doubt, read the man page :) > > Cheers > -- > t >
Re: why would "tr --complement --squeeze-repeats ..." append the substitution char once more? ...
On Mon, Dec 11, 2023 at 08:04:06AM +, Albretch Mueller wrote: > On 12/11/23, Greg Wooledge wrote: > > Please tell us ... > > OK, here is what I did as a t-table [...] Your style is confusing, to say the least. Why not play with minimal examples and work your way up from that? > the two strings are not the same length even though your are just > replacing ASCII characters, why did: > echo "${ftype}" | tr --complement --squeeze-repeats '[A-Za-z0-9.]' '_' > place a character at the end? Two things stick out: 1. with --squeeze-repeats you are challenging tr to output less characters than the input has: trotzki:~$ echo -n "this is a # string ###" | tr -cs 'a-z' '_' => this_is_a_string_ (I allowed myself to simplify things a bit) See? tr is squeezing repeats (repeated matches), the space-plus-three-hashes at the end gets squeezed to just one _, thus changing the length. If your strings contain more than one non-alphanumeric (something I don't feel like even trying a guess at), this is bound to happen. You ordered it. 2. This is tr, not regexp, so '[A-Za-z0-9.]' isn't doing what you think it does. It will match '[', 'A' to 'Z', 'a' to 'z','.' and ']'. I guess you want to say 'A-Za-z0-9.' 3. As a convenience, tr has char classes. Perhaps [:alnum:] is for you. No idea whether this is a GNU extension 4. In case of doubt, read the man page :) Cheers -- t signature.asc Description: PGP signature
Re: why would "tr --complement --squeeze-repeats ..." append the substitution char once more? ...
On 12/11/23, Greg Wooledge wrote: > Please tell us ... OK, here is what I did as a t-table echo "abc123" > file.txt # obvious text file ftype=$(file --brief file.txt) # got its type as reported by the "file" utility echo "// __ \$ftype: |${ftype}|" ftypelen=${#ftype} # length of the string containing the file type echo "// __ \$ftypelen: |${ftypelen}|" # removing spaces et any other char which is not '[A-Za-z0-9.]' replacing with underscores ... # here is what I think to be an error happened instead of just replacing ... by underscores # it adds an underscore at the end? ftype2=$(echo "${ftype}" | tr --complement --squeeze-repeats '[A-Za-z0-9.]' '_'); echo "// __ \$ftype2: |${ftype2}|" ftype2len=${#ftype2} echo "// __ \$ftype2len: |${ftype2len}|" the two strings are not the same length even though your are just replacing ASCII characters, why did: echo "${ftype}" | tr --complement --squeeze-repeats '[A-Za-z0-9.]' '_' place a character at the end? Probably echo and tr are not dancing well together. echo might be tailgating an end of string character which tr then replaces with an underscore. which option do I use with echo for that not to happen? SHould I probably play with IFS ...? lbrtchx On 12/11/23, Greg Wooledge wrote: > On Mon, Dec 11, 2023 at 02:53:07AM +, Albretch Mueller wrote: >> echo "abc123" > file.txt >> ftype=$(file --brief file.txt) >> echo "// __ \$ftype: |${ftype}|" >> ftypelen=${#ftype} >> echo "// __ \$ftypelen: |${ftypelen}|" >> >> # removing spaces ... >> ftype2=$(echo "${ftype}" | tr --complement --squeeze-repeats >> '[A-Za-z0-9.]' '_'); >> echo "// __ \$ftype2: |${ftype2}|" >> ftype2len=${#ftype2} >> echo "// __ \$ftype2len: |${ftype2len}|" > > Please tell us: > > * What you are trying to do. > > * What you did (is the previous code all in a script? if so, this is a >good answer for this part). > > * What result you got. > > * What you expected to get.
Re: why would "tr --complement --squeeze-repeats ..." append the substitution char once more? ...
On Mon, Dec 11, 2023 at 02:53:07AM +, Albretch Mueller wrote: > echo "abc123" > file.txt > ftype=$(file --brief file.txt) > echo "// __ \$ftype: |${ftype}|" > ftypelen=${#ftype} > echo "// __ \$ftypelen: |${ftypelen}|" > > # removing spaces ... > ftype2=$(echo "${ftype}" | tr --complement --squeeze-repeats > '[A-Za-z0-9.]' '_'); > echo "// __ \$ftype2: |${ftype2}|" > ftype2len=${#ftype2} > echo "// __ \$ftype2len: |${ftype2len}|" Please tell us: * What you are trying to do. * What you did (is the previous code all in a script? if so, this is a good answer for this part). * What result you got. * What you expected to get.
why would "tr --complement --squeeze-repeats ..." append the substitution char once more? ...
echo "abc123" > file.txt ftype=$(file --brief file.txt) echo "// __ \$ftype: |${ftype}|" ftypelen=${#ftype} echo "// __ \$ftypelen: |${ftypelen}|" # removing spaces ... ftype2=$(echo "${ftype}" | tr --complement --squeeze-repeats '[A-Za-z0-9.]' '_'); echo "// __ \$ftype2: |${ftype2}|" ftype2len=${#ftype2} echo "// __ \$ftype2len: |${ftype2len}|" lbrtchx