Re: [racket-users] Are Regular Expression classes Unicode aware?

2020-07-11 Thread Ryan Culpepper
Great, I'm glad it was useful!

Ryan


On Sat, Jul 11, 2020 at 12:27 PM Peter W A Wood 
wrote:

> Dear Ryan
>
> Thank you for both your full, complete and understandable explanation and
> a working solution which is more than sufficient for my needs.
>
> I created a very simple function based on the reg=exp that you suggested
> and tested it against a number of cases:
>
>
> #lang racket
> (require test-engine/racket-tests)
>
> (check-expect (alpha? "") #f)   ;
> empty string
> (check-expect (alpha? "1") #f)
> (check-expect (alpha? "a") #t)
> (check-expect (alpha? "hello") #t)
> (check-expect (alpha? "h1llo") #f)
> (check-expect (alpha? "\u00E7c\u0327") #t)   ; çç
> (check-expect (alpha? "noe\u0308l") #t) ; noél
> (check-expect (alpha? "\U01D122") #f)   ; 턢 (bass
> clef)
> (check-expect (alpha? "\u216B") #f)   ; Ⅻ (roman
> numeral)
> (check-expect (alpha? "\u0BEB") #f)   ; ௫ (5 in
> Tamil)
> (check-expect (alpha? "二の句") #t); Japanese
> word "ninoku"
> (check-expect (alpha? "مدينة") #t); Arabic
> word "madina"
> (check-expect (alpha? "٥") #f) ;
> Arabic number 5
> (check-expect (alpha? "\u0628\uFCF2") #t); Arabic letter
> beh with shaddah
> (define (alpha? s)
>  (regexp-match? #px"^\\p{L}+$" (string-normalize-nfc s)))
> (test)
>
> I suspect that there are some cases with scripts requiring multiple code
> points to render a single character such as Arabic with pronunciation marks
> e.g. دُ نْيَا. At the moment, I don’t have the time (or need) to
> investigate further.
>
> The depth of Racket’s Unicode support is impressive.
>
> Once again, thanks.
>
> Peter
>
>
> > On 10 Jul 2020, at 15:47, Ryan Culpepper  wrote:
> >
> > (I see this went off the mailing list. If you reply, please consider
> CCing the list.)
> >
> > Yes, I understood your goal of trying to capture the notion of Unicode
> "alphabetic" characters with a regular expression.
> >
> > As far as I know, Unicode doesn't have a notion of "alphabetic", but it
> does assign every code point to a "General category", consisting of a main
> category and a subcategory. There is a category called "Letter", which
> seems like one reasonable generalization of "alphabetic".
> >
> > In Racket, you can get the code for a character's category using
> `char-general-category`. For example:
> >
> >   > (char-general-category #\A)
> >   'lu
> >   > (char-general-category #\é)
> >   'll
> >   > (char-general-category #\ß)
> >   'll
> >   > (char-general-category #\7)
> >   'nd
> >
> > The general category for "A" is "Letter, uppercase", which has the code
> "Lu", which Racket turns into the symbol 'lu. The general category of "é"
> is "Letter, lowercase", code "Ll", which becomes 'll. The general category
> of "7" is "Number, decimal digit", code "Nd".
> >
> > In Racket regular expressions, the \p{category} syntax lets you
> recognize characters from a specific category. For example, \p{Lu}
> recognizes characters with the category "Letter, uppercase", and \p{L}
> recognizes characters with the category "Letter", which is the union of
> "Letter, uppercase", "Letter, lowercase", and so on.
> >
> > So the regular expression #px"^\\p{L}+$" recognizes sequences of one or
> more Unicode letters. For example:
> >
> >   > (regexp-match? #px"^\\p{L}+$" "héllo")
> >   #t
> >   > (regexp-match? #px"^\\p{L}+$" "straße")
> >   #t
> >   > (regexp-match? #px"^\\p{L}+$" "二の句")
> >   #t
> >   > (regexp-match? #px"^\\p{L}+$" "abc123")
> >   #f ;; No, contains numbers
> >
> > There are still some problems to watch out for, though. For example,
> accented characters like "é" can be expressed as a single pre-composed code
> point or "decomposed" into a base letter and a combining mark. You can get
> the decomposed form by converting the string to "decomposed normal form"
> (NFD), and the regexp above won't match that string.
> >
> >   > (map char-general-category (string->list (string-normalize-nfd "é")))
> >   '(ll mn)
> >   > (regexp-match? #px"^\\p{L}+$" (string-normalize-nfd "héllo"))
> >   #f
> > 
> > One fix would be to call `string-normalize-nfc` first, but some
> letter-modifier pairs don't have pre-composed versions. Another fix would
> be to expand the regexp to include modifiers. You'd have to decide which is
> better based on your application.
> >
> > Ryan
> >
>
> --
> You received this message because you are subscribed to the Google Groups
> "Racket Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to racket-users+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/racket-users/09B244A4-89C5-4B5C-97E7-5487059125F6%40gmail.com
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Racket 

Re: [racket-users] Are Regular Expression classes Unicode aware?

2020-07-11 Thread Peter W A Wood
Dear Ryan

Thank you for both your full, complete and understandable explanation and a 
working solution which is more than sufficient for my needs.

I created a very simple function based on the reg=exp that you suggested and 
tested it against a number of cases:


#lang racket
(require test-engine/racket-tests)

(check-expect (alpha? "") #f)   ; empty 
string
(check-expect (alpha? "1") #f)   
(check-expect (alpha? "a") #t)
(check-expect (alpha? "hello") #t)
(check-expect (alpha? "h1llo") #f)
(check-expect (alpha? "\u00E7c\u0327") #t)   ; çç
(check-expect (alpha? "noe\u0308l") #t) ; noél
(check-expect (alpha? "\U01D122") #f)   ; 턢 (bass clef)
(check-expect (alpha? "\u216B") #f)   ; Ⅻ (roman 
numeral)
(check-expect (alpha? "\u0BEB") #f)   ; ௫ (5 in Tamil)
(check-expect (alpha? "二の句") #t); Japanese word 
"ninoku"
(check-expect (alpha? "مدينة") #t); Arabic word 
"madina"
(check-expect (alpha? "٥") #f) ; Arabic 
number 5
(check-expect (alpha? "\u0628\uFCF2") #t); Arabic letter beh 
with shaddah
(define (alpha? s)
 (regexp-match? #px"^\\p{L}+$" (string-normalize-nfc s)))
(test)

I suspect that there are some cases with scripts requiring multiple code points 
to render a single character such as Arabic with pronunciation marks e.g. دُ 
نْيَا. At the moment, I don’t have the time (or need) to investigate further.  

The depth of Racket’s Unicode support is impressive.

Once again, thanks.

Peter


> On 10 Jul 2020, at 15:47, Ryan Culpepper  wrote:
> 
> (I see this went off the mailing list. If you reply, please consider CCing 
> the list.)
> 
> Yes, I understood your goal of trying to capture the notion of Unicode 
> "alphabetic" characters with a regular expression.
> 
> As far as I know, Unicode doesn't have a notion of "alphabetic", but it does 
> assign every code point to a "General category", consisting of a main 
> category and a subcategory. There is a category called "Letter", which seems 
> like one reasonable generalization of "alphabetic".
> 
> In Racket, you can get the code for a character's category using 
> `char-general-category`. For example:
> 
>   > (char-general-category #\A)
>   'lu
>   > (char-general-category #\é)
>   'll
>   > (char-general-category #\ß)
>   'll
>   > (char-general-category #\7)
>   'nd
> 
> The general category for "A" is "Letter, uppercase", which has the code "Lu", 
> which Racket turns into the symbol 'lu. The general category of "é" is 
> "Letter, lowercase", code "Ll", which becomes 'll. The general category of 
> "7" is "Number, decimal digit", code "Nd".
> 
> In Racket regular expressions, the \p{category} syntax lets you recognize 
> characters from a specific category. For example, \p{Lu} recognizes 
> characters with the category "Letter, uppercase", and \p{L} recognizes 
> characters with the category "Letter", which is the union of "Letter, 
> uppercase", "Letter, lowercase", and so on.
> 
> So the regular expression #px"^\\p{L}+$" recognizes sequences of one or more 
> Unicode letters. For example:
> 
>   > (regexp-match? #px"^\\p{L}+$" "héllo")
>   #t
>   > (regexp-match? #px"^\\p{L}+$" "straße")
>   #t
>   > (regexp-match? #px"^\\p{L}+$" "二の句")
>   #t
>   > (regexp-match? #px"^\\p{L}+$" "abc123")
>   #f ;; No, contains numbers
> 
> There are still some problems to watch out for, though. For example, accented 
> characters like "é" can be expressed as a single pre-composed code point or 
> "decomposed" into a base letter and a combining mark. You can get the 
> decomposed form by converting the string to "decomposed normal form" (NFD), 
> and the regexp above won't match that string.
> 
>   > (map char-general-category (string->list (string-normalize-nfd "é")))
>   '(ll mn)
>   > (regexp-match? #px"^\\p{L}+$" (string-normalize-nfd "héllo"))
>   #f
> 
> One fix would be to call `string-normalize-nfc` first, but some 
> letter-modifier pairs don't have pre-composed versions. Another fix would be 
> to expand the regexp to include modifiers. You'd have to decide which is 
> better based on your application.
> 
> Ryan
> 

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/09B244A4-89C5-4B5C-97E7-5487059125F6%40gmail.com.


Re: [racket-users] Are Regular Expression classes Unicode aware?

2020-07-10 Thread Peter W A Wood
Dear Ryan

Thank you very much for the kind, detailed explanation which I will study 
carefully. It was not my intention to reply to you off-list. I hope I have 
correctly addressed this reply to appear on-list.

Peter

> On 10 Jul 2020, at 15:47, Ryan Culpepper  wrote:
> 
> (I see this went off the mailing list. If you reply, please consider CCing 
> the list.)
> 
> Yes, I understood your goal of trying to capture the notion of Unicode 
> "alphabetic" characters with a regular expression.
> 
> As far as I know, Unicode doesn't have a notion of "alphabetic", but it does 
> assign every code point to a "General category", consisting of a main 
> category and a subcategory. There is a category called "Letter", which seems 
> like one reasonable generalization of "alphabetic".
> 
> In Racket, you can get the code for a character's category using 
> `char-general-category`. For example:
> 
>   > (char-general-category #\A)
>   'lu
>   > (char-general-category #\é)
>   'll
>   > (char-general-category #\ß)
>   'll
>   > (char-general-category #\7)
>   'nd
> 
> The general category for "A" is "Letter, uppercase", which has the code "Lu", 
> which Racket turns into the symbol 'lu. The general category of "é" is 
> "Letter, lowercase", code "Ll", which becomes 'll. The general category of 
> "7" is "Number, decimal digit", code "Nd".
> 
> In Racket regular expressions, the \p{category} syntax lets you recognize 
> characters from a specific category. For example, \p{Lu} recognizes 
> characters with the category "Letter, uppercase", and \p{L} recognizes 
> characters with the category "Letter", which is the union of "Letter, 
> uppercase", "Letter, lowercase", and so on.
> 
> So the regular expression #px"^\\p{L}+$" recognizes sequences of one or more 
> Unicode letters. For example:
> 
>   > (regexp-match? #px"^\\p{L}+$" "héllo")
>   #t
>   > (regexp-match? #px"^\\p{L}+$" "straße")
>   #t
>   > (regexp-match? #px"^\\p{L}+$" "二の句")
>   #t
>   > (regexp-match? #px"^\\p{L}+$" "abc123")
>   #f ;; No, contains numbers
> 
> There are still some problems to watch out for, though. For example, accented 
> characters like "é" can be expressed as a single pre-composed code point or 
> "decomposed" into a base letter and a combining mark. You can get the 
> decomposed form by converting the string to "decomposed normal form" (NFD), 
> and the regexp above won't match that string.
> 
>   > (map char-general-category (string->list (string-normalize-nfd "é")))
>   '(ll mn)
>   > (regexp-match? #px"^\\p{L}+$" (string-normalize-nfd "héllo"))
>   #f
> 
> One fix would be to call `string-normalize-nfc` first, but some 
> letter-modifier pairs don't have pre-composed versions. Another fix would be 
> to expand the regexp to include modifiers. You'd have to decide which is 
> better based on your application.
> 
> Ryan
> 
> 
> 
> On Fri, Jul 10, 2020 at 2:10 AM Peter W A Wood  wrote:
> Ryan
> 
> > On 9 Jul 2020, at 22:52, Ryan Culpepper  wrote:
> > 
> > If you want a regular expression that does match the example string, you 
> > can use the \p{property} notation. For example:
> > 
> >   > (regexp-match? #px"^\\p{L}+$" "h\uFFC3\uFFA9llo")
> >   #t
> > 
> > The "Regexp Syntax" docs have a grammar for regular expressions with links 
> > to examples.
> > 
> > Ryan
> 
> Thanks. I used héllo as an example. I was wondering if there was a way of 
> specifying a regular expression group for Unicode “alphabetic” characters. 
> 
> On reflection, it seems a somewhat esoteric requirement that is almost 
> impossible to satisfy. By way of example, would 
> “Straße" be considered alphabetic? Would “二の句” be considered alphabetic?
> 
> Strangely, Python considered the Japanese characters as being alphabetic but 
> will not accept “Straße” as a valid string. (I suspect this is due to some 
> problem relating to Locale..
> 
>  >>> "二の句".isalpha()
> True
> >>> “Straße".isalpha()
>   File "", line 1
> “Straße".isalpha()
>   ^
> SyntaxError: invalid character in identifier
> 
> Clearly, trying to identify “Unicode” alphabetic characters is far from 
> straightforward, though it may well be useful in processing some language 
> texts.
> 
> Peter
> 
> 

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/BC855B5D-80BF-458B-A2D2-9570B0436646%40gmail.com.


Re: [racket-users] Are Regular Expression classes Unicode aware?

2020-07-09 Thread Sorawee Porncharoenwase
I did in fact try installing readline-gpl (raco pkg install readline-gpl),
but it didn’t change anything. Interestingly, the bug in #3223 persists for
me, too. This suggests that I didn’t install or invoke it correctly. Do you
need to run racket with any flag to make readline-gpl take its effect?

But yes, the problem is definitely due to readline. Sam suggested me
to try racket
-q which suppresses readline, and the result is that there’s no issue.

On Thu, Jul 9, 2020 at 11:43 AM Philip McGrath 
wrote:

> On Thu, Jul 9, 2020 at 10:32 AM Sorawee Porncharoenwase <
> sorawee.pw...@gmail.com> wrote:
>
>> Racket REPL doesn’t handle unicode well. If you try (regexp-match?
>> #px"^[a-zA-Z]+$" "héllo") in DrRacket, or write it as a program in a
>> file and run it, you will find that it does evaluate to #f.
>>
> See this issue for workarounds, including installing the `readline-gpl`
> package: https://github.com/racket/racket/issues/3223
>
> But you may have some other issues: for me, `(regexp-match?
> #px"^[a-zA-Z]+$" "h\U+FFC3\U+FFA9llo")` gives an error saying "read-syntax:
> no hex digit following `\U`"
>
> For the original question:
>
>
>> On Thu, Jul 9, 2020 at 7:19 AM Peter W A Wood 
>> wrote:
>>
>>> I was experimenting with regular expressions to try to emulate the
>>> Python isalpha() String method.
>>>
>>
> You'd want to benchmark, but, for this purpose, I have a hunch you might
> get better performance by using `in-string` with a `for/and` loop (which
> can use unsafe operations internally)—probably especially so if you were
> content to just test `char-alphabetic?`, which follows Unicode's definition
> of "alphabetic" rather that Python's idiosyncratic one. Here's an example:
>
> #lang racket
>>
>> (module+ test
>>   (require rackunit))
>>
>> (define (char-letter? ch)
>>   ;; not the same as `char-alphabetic?`: see
>>   ;; https://docs.python.org/3/library/stdtypes.html#str.isalpha
>>   (case (char-general-category ch)
>> [(lm lt lu ll lo) #t]
>> [else #f]))
>>
>> (define (string-is-alpha? str)
>>   (for/and ([ch (in-string str)])
>> (char-letter? ch)))
>>
>> (module+ test
>>   (check-true (string-is-alpha? "hello"))
>>   (check-false (string-is-alpha? "h1llo"))
>>   (check-true (string-is-alpha? "héllo")))
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/CADcuegsvwBVDhnjtR5Gu6itoYWQPiiQHdwcZnaMv1Qvne2dAVg%40mail.gmail.com.


Re: [racket-users] Are Regular Expression classes Unicode aware?

2020-07-09 Thread Philip McGrath
On Thu, Jul 9, 2020 at 10:32 AM Sorawee Porncharoenwase <
sorawee.pw...@gmail.com> wrote:

> Racket REPL doesn’t handle unicode well. If you try (regexp-match?
> #px"^[a-zA-Z]+$" "héllo") in DrRacket, or write it as a program in a file
> and run it, you will find that it does evaluate to #f.
>
See this issue for workarounds, including installing the `readline-gpl`
package: https://github.com/racket/racket/issues/3223

But you may have some other issues: for me, `(regexp-match?
#px"^[a-zA-Z]+$" "h\U+FFC3\U+FFA9llo")` gives an error saying "read-syntax:
no hex digit following `\U`"

For the original question:


> On Thu, Jul 9, 2020 at 7:19 AM Peter W A Wood 
> wrote:
>
>> I was experimenting with regular expressions to try to emulate the Python
>> isalpha() String method.
>>
>
You'd want to benchmark, but, for this purpose, I have a hunch you might
get better performance by using `in-string` with a `for/and` loop (which
can use unsafe operations internally)—probably especially so if you were
content to just test `char-alphabetic?`, which follows Unicode's definition
of "alphabetic" rather that Python's idiosyncratic one. Here's an example:

#lang racket
>
> (module+ test
>   (require rackunit))
>
> (define (char-letter? ch)
>   ;; not the same as `char-alphabetic?`: see
>   ;; https://docs.python.org/3/library/stdtypes.html#str.isalpha
>   (case (char-general-category ch)
> [(lm lt lu ll lo) #t]
> [else #f]))
>
> (define (string-is-alpha? str)
>   (for/and ([ch (in-string str)])
> (char-letter? ch)))
>
> (module+ test
>   (check-true (string-is-alpha? "hello"))
>   (check-false (string-is-alpha? "h1llo"))
>   (check-true (string-is-alpha? "héllo")))
>

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/CAH3z3gYfZqGe5hQheAsSxdX8VzAsGYhi61W5ZpmMkkaRb0F%2B5A%40mail.gmail.com.


Re: [racket-users] Are Regular Expression classes Unicode aware?

2020-07-09 Thread Ryan Culpepper
If you want a regular expression that does match the example string, you
can use the \p{property} notation. For example:

  > (regexp-match? #px"^\\p{L}+$" "h\uFFC3\uFFA9llo")
  #t

The "Regexp Syntax" docs have a grammar for regular expressions with links
to examples.

Ryan


On Thu, Jul 9, 2020 at 4:32 PM Sorawee Porncharoenwase <
sorawee.pw...@gmail.com> wrote:

> Racket REPL doesn’t handle unicode well. If you try (regexp-match?
> #px"^[a-zA-Z]+$" "héllo") in DrRacket, or write it as a program in a file
> and run it, you will find that it does evaluate to #f.
>
> On Thu, Jul 9, 2020 at 7:19 AM Peter W A Wood 
> wrote:
>
>> I was experimenting with regular expressions to try to emulate the Python
>> isalpha() String method. Using a simple [a-zA-Z] character class worked for
>> the English alphabet (ASCII characters):
>>
>> > (regexp-match? #px"^[a-zA-Z]+$" "hello")
>> #t
>> > (regexp-match? #px"^[a-zA-Z]+$" "h1llo")
>> #f
>>
>> It then dawned on me that the Python is alpha() method was Unicode aware:
>>
>> >>> "é".isalpha()
>> True
>>
>> I started scratching my head as how to achieve the equivalent using a
>> regular expression in Python. I tried the same regular expression with a
>> non-English character in the string. To my surprise, the regular expression
>> recognised the non-ASCII characters.
>>
>> > (regexp-match? #px"^[a-zA-Z]+$" "h\U+FFC3\U+FFA9llo")
>> #t
>>
>> Are Racket regular expression character classes Unicode aware or is there
>> some other explanation why this regular expression matches?
>>
>> Peter
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Racket Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to racket-users+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/racket-users/2197C34F-165D-4D97-97AD-F158153316F5%40gmail.com
>> .
>>
> --
> You received this message because you are subscribed to the Google Groups
> "Racket Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to racket-users+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/racket-users/CADcuegsvf-hFwofptc2ieKQmqWFzxDnD1Cn8G7bFSzBZ%2BM3EDA%40mail.gmail.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/CANy33q%3DtBkQYDg-Tv1MEw17P1ipnqUDcDDFmq_%3DTumUAGJrHAA%40mail.gmail.com.


Re: [racket-users] Are Regular Expression classes Unicode aware?

2020-07-09 Thread Sorawee Porncharoenwase
Racket REPL doesn’t handle unicode well. If you try (regexp-match?
#px"^[a-zA-Z]+$" "héllo") in DrRacket, or write it as a program in a file
and run it, you will find that it does evaluate to #f.

On Thu, Jul 9, 2020 at 7:19 AM Peter W A Wood  wrote:

> I was experimenting with regular expressions to try to emulate the Python
> isalpha() String method. Using a simple [a-zA-Z] character class worked for
> the English alphabet (ASCII characters):
>
> > (regexp-match? #px"^[a-zA-Z]+$" "hello")
> #t
> > (regexp-match? #px"^[a-zA-Z]+$" "h1llo")
> #f
>
> It then dawned on me that the Python is alpha() method was Unicode aware:
>
> >>> "é".isalpha()
> True
>
> I started scratching my head as how to achieve the equivalent using a
> regular expression in Python. I tried the same regular expression with a
> non-English character in the string. To my surprise, the regular expression
> recognised the non-ASCII characters.
>
> > (regexp-match? #px"^[a-zA-Z]+$" "h\U+FFC3\U+FFA9llo")
> #t
>
> Are Racket regular expression character classes Unicode aware or is there
> some other explanation why this regular expression matches?
>
> Peter
>
> --
> You received this message because you are subscribed to the Google Groups
> "Racket Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to racket-users+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/racket-users/2197C34F-165D-4D97-97AD-F158153316F5%40gmail.com
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/CADcuegsvf-hFwofptc2ieKQmqWFzxDnD1Cn8G7bFSzBZ%2BM3EDA%40mail.gmail.com.


[racket-users] Are Regular Expression classes Unicode aware?

2020-07-09 Thread Peter W A Wood
I was experimenting with regular expressions to try to emulate the Python 
isalpha() String method. Using a simple [a-zA-Z] character class worked for the 
English alphabet (ASCII characters):

> (regexp-match? #px"^[a-zA-Z]+$" "hello")
#t
> (regexp-match? #px"^[a-zA-Z]+$" "h1llo")
#f 

It then dawned on me that the Python is alpha() method was Unicode aware:

>>> "é".isalpha()
True

I started scratching my head as how to achieve the equivalent using a regular 
expression in Python. I tried the same regular expression with a non-English 
character in the string. To my surprise, the regular expression recognised the 
non-ASCII characters.

> (regexp-match? #px"^[a-zA-Z]+$" "h\U+FFC3\U+FFA9llo")
#t

Are Racket regular expression character classes Unicode aware or is there some 
other explanation why this regular expression matches?

Peter

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/2197C34F-165D-4D97-97AD-F158153316F5%40gmail.com.