Re: [racket-users] Are Regular Expression classes Unicode aware?

Ryan Culpepper Sat, 11 Jul 2020 04:42:14 -0700

Great, I'm glad it was useful!

Ryan



On Sat, Jul 11, 2020 at 12:27 PM Peter W A Wood <peterwaw...@gmail.com>
wrote:

> Dear Ryan
>
> Thank you for both your full, complete and understandable explanation and
> a working solution which is more than sufficient for my needs.
>
> I created a very simple function based on the reg=exp that you suggested
> and tested it against a number of cases:
>
>
> #lang racket
> (require test-engine/racket-tests)
>
> (check-expect (alpha? "") #f)                                       ;
> empty string
> (check-expect (alpha? "1") #f)
> (check-expect (alpha? "a") #t)
> (check-expect (alpha? "hello") #t)
> (check-expect (alpha? "h1llo") #f)
> (check-expect (alpha? "\u00E7c\u0327") #t)               ; çç
> (check-expect (alpha? "noe\u0308l") #t)                     ; noél
> (check-expect (alpha? "\U01D122") #f)                       ; 𝄢 (bass
> clef)
> (check-expect (alpha? "\u216B") #f)                           ; Ⅻ (roman
> numeral)
> (check-expect (alpha? "\u0BEB") #f)                           ; ௫ (5 in
> Tamil)
> (check-expect (alpha? "二の句") #t)                            ; Japanese
> word "ninoku"
> (check-expect (alpha? "مدينة") #t)                                ; Arabic
> word "madina"
> (check-expect (alpha? "٥") #f)                                     ;
> Arabic number 5
> (check-expect (alpha? "\u0628\uFCF2") #t)                ; Arabic letter
> beh with shaddah
> (define (alpha? s)
>  (regexp-match? #px"^\\p{L}+$" (string-normalize-nfc s)))
> (test)
>
> I suspect that there are some cases with scripts requiring multiple code
> points to render a single character such as Arabic with pronunciation marks
> e.g. دُ نْيَا. At the moment, I don’t have the time (or need) to
> investigate further.
>
> The depth of Racket’s Unicode support is impressive.
>
> Once again, thanks.
>
> Peter
>
>
> > On 10 Jul 2020, at 15:47, Ryan Culpepper <rmculpepp...@gmail.com> wrote:
> >
> > (I see this went off the mailing list. If you reply, please consider
> CCing the list.)
> >
> > Yes, I understood your goal of trying to capture the notion of Unicode
> "alphabetic" characters with a regular expression.
> >
> > As far as I know, Unicode doesn't have a notion of "alphabetic", but it
> does assign every code point to a "General category", consisting of a main
> category and a subcategory. There is a category called "Letter", which
> seems like one reasonable generalization of "alphabetic".
> >
> > In Racket, you can get the code for a character's category using
> `char-general-category`. For example:
> >
> >   > (char-general-category #\A)
> >   'lu
> >   > (char-general-category #\é)
> >   'll
> >   > (char-general-category #\ß)
> >   'll
> >   > (char-general-category #\7)
> >   'nd
> >
> > The general category for "A" is "Letter, uppercase", which has the code
> "Lu", which Racket turns into the symbol 'lu. The general category of "é"
> is "Letter, lowercase", code "Ll", which becomes 'll. The general category
> of "7" is "Number, decimal digit", code "Nd".
> >
> > In Racket regular expressions, the \p{category} syntax lets you
> recognize characters from a specific category. For example, \p{Lu}
> recognizes characters with the category "Letter, uppercase", and \p{L}
> recognizes characters with the category "Letter", which is the union of
> "Letter, uppercase", "Letter, lowercase", and so on.
> >
> > So the regular expression #px"^\\p{L}+$" recognizes sequences of one or
> more Unicode letters. For example:
> >
> >   > (regexp-match? #px"^\\p{L}+$" "héllo")
> >   #t
> >   > (regexp-match? #px"^\\p{L}+$" "straße")
> >   #t
> >   > (regexp-match? #px"^\\p{L}+$" "二の句")
> >   #t
> >   > (regexp-match? #px"^\\p{L}+$" "abc123")
> >   #f ;; No, contains numbers
> >
> > There are still some problems to watch out for, though. For example,
> accented characters like "é" can be expressed as a single pre-composed code
> point or "decomposed" into a base letter and a combining mark. You can get
> the decomposed form by converting the string to "decomposed normal form"
> (NFD), and the regexp above won't match that string.
> >
> >   > (map char-general-category (string->list (string-normalize-nfd "é")))
> >   '(ll mn)
> >   > (regexp-match? #px"^\\p{L}+$" (string-normalize-nfd "héllo"))
> >   #f
> > 
> > One fix would be to call `string-normalize-nfc` first, but some
> letter-modifier pairs don't have pre-composed versions. Another fix would
> be to expand the regexp to include modifiers. You'd have to decide which is
> better based on your application.
> >
> > Ryan
> >
>
> --
> You received this message because you are subscribed to the Google Groups
> "Racket Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to racket-users+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/racket-users/09B244A4-89C5-4B5C-97E7-5487059125F6%40gmail.com
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/CANy33qmWderKJgG%2Bqqw7k__ccZUg-KmX3U4RBfB-SAb4H%2BoNoQ%40mail.gmail.com.

Re: [racket-users] Are Regular Expression classes Unicode aware?

Reply via email to