Re: [Rd] gsub() hex character range problems in R-devel?

Martin Morgan Thu, 06 Jan 2022 08:47:57 -0800

Thanks Tomas and 'Brodie' for your expert explanation; it provides great help 
in understanding and solving my immediate problem.


Thomas' observation to 'do something like e.g. "only keep ASCII digits, ASCII 
space, ASCII underscore, but remove all other characters"' points to a basic 
weakness in the code I'm looking at. E.g., removing non-breaking space is 
probably not appropriate ('foo\ua0bar' is probably cleaned to 'foo bar' and not 
'foobar'). And more generally other non-ASCII characters ('fancy' quotes, 
em-dashes, ...) would require special treatment. It seems like the right thing 
to do is to handle the raw data in its original encoding, rather than to try to 
clean it to ASCII.

Martin

On 1/5/22, 4:17 AM, "Tomas Kalibera" <tomas.kalib...@gmail.com> wrote:

    Hi Martin,

    I'd add few comments to the excellent analysis of Brodie.

    - \xhh is allowed and defined in Perl regular expressions, see ?regex 
    (would need perl=TRUE), but to enter that in an R string, you need to 
    escape the backslash.

    - \xhh is not defined by POSIX for extended regular expressions, neither 
    it is documented in ?regex for those; TRE supports it, but still 
    portable programs should not rely on that

    - literal \xhh in an R string is turned to the byte by R, but I would 
    say this should not be used at all by users, because the result is 
    encoding specific

    - use of \u and \U in an R string is fine, it has well defined semantics 
    and the corresponding string will then be flagged UTF-8 in R (so e.g. 
    \ua0 is fine to represent the Unicode no-break space)

    - see caveats of using character ranges with POSIX extended regular 
    expressions in ?regex re encodings, using Perl regular expressions in 
    UTF-8 mode is more reliable for those

    So, a variant of your example might be:

     > gsub("[\\x7f-\\xff]", "", "fo\ua0o", perl=TRUE)
    [1] "foo"

    (note that the \ua0 ensures that the text is UTF-8, and hence the UTF-8 
    mode for regular expressions is used, ?regex has more)

    However, I think it is better to formulate regular expressions to cover 
    all of Unicode, so do something like e.g. "only keep ASCII digits, ASCII 
    space, ASCII underscore, but remove all other characters".

    Best
    Tomas

    On 1/4/22 8:35 PM, Martin Morgan wrote:

    > I'm not very good at character encoding / etc so this might be user 
error. The following code is meant to replace extended ASCII characters, in 
particular a non-breaking space, with "", and it works in R-4-1-branch
    >
    >> R.version.string
    > [1] "R version 4.1.2 Patched (2022-01-04 r81445)"
    >> gsub("[\x7f-\xff]", "", "fo\xa0o")
    > [1] "foo"
    >
    > but fails in R-devel
    >
    >> R.version.string
    > [1] "R Under development (unstable) (2022-01-04 r81445)"
    >> gsub("[\x7f-\xff]", "", "fo\xa0o")
    > Error in gsub("[\177-\xff]", "", "fo\xa0o") : invalid regular expression 
'[-�]', reason 'Invalid character range'
    > In addition: Warning message:
    > In gsub("[\177-\xff]", "", "fo\xa0o") :
    >    TRE pattern compilation error 'Invalid character range'
    >
    > There are other oddities, too, like
    >
    >> gsub("[[:alnum:]]", "", "fo\xa0o")  # R-4-1-branch
    > [1] "\xfc\xbe\x8c\x86\x84\xbc"
    >
    >> gsub("[[:alnum:]]", "", "fo\xa0o")  # R-devel
    > [1] "<>"
    >
    > The R-devel sessionInfo is
    >
    >> sessionInfo()
    > R Under development (unstable) (2022-01-04 r81445)
    > Platform: x86_64-apple-darwin19.6.0 (64-bit)
    > Running under: macOS Catalina 10.15.7
    >
    > Matrix products: default
    > BLAS:   /Users/ma38727/bin/R-devel/lib/libRblas.dylib
    > LAPACK: /Users/ma38727/bin/R-devel/lib/libRlapack.dylib
    >
    > locale:
    > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
    >
    > attached base packages:
    > [1] stats     graphics  grDevices utils     datasets  methods   base
    >
    > loaded via a namespace (and not attached):
    > [1] compiler_4.2.0
    >
    > (I have built my own R on macOS; similar behavior is observed on a Linux 
machine)
    >
    > Any hints welcome,
    >
    > Martin Morgan
    > ______________________________________________
    > R-devel@r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] gsub() hex character range problems in R-devel?

Reply via email to