Re: German Umlauts / UTF8 with comparse
Yes, this helps. Kind of ;-) ... using the character set char-set:alphabetic, my umlauts are now parsed. But I don't get them back in my result, at least not as printable characters. Instead, the following happens, and utterly confuses me: #;2> (define s3 (parse letters (string->list s))) #;3> s3 "Gnsesger" #;4> (string-length s3) 6 #;5> (string->list s3) (#\G #\x4bb3 #\e #\s #\x49e5 #\r) #;6> (list->string (string->list s3)) "G䮳es䧥r" So, I put the parse result into 's3'. Printing it, I read an eight character string, namely the one I want, minus my beloved umlauts. 'string-length' returns that string to be six characters long, and 'string->list' gives me exactly that, swallowing still other ASCII characters of my string and reversing that using 'list->string' includes Chinese ... even though '(list->string (string->list s1))', with my pure ASCII string, reverses without fault. I guess I have some problems understanding some utf8 concepts?! /Christoph On Mon, Feb 17, 2020 at 3:38 PM wrote: > Christoph Lange wrote: > > meaning, that the ä isn't recognized as being a letter within the > > 'char-set:letter'. > > The utf8 egg’s srfi-14 character sets are designed to be compatible with > the original srfi-14 and only contain ASCII characters, as stated in the > documentation: > https://wiki.call-cc.org/eggref/5/utf8#unicode-char-sets > “The default SRFI-14 char-sets are defined using ASCII-only characters” > > You might want to import the unicode-char-sets module, and use one of its > sets, like char-set:alphabetic. > > I hope this helps. :) > -- Christoph Lange Lotsarnas Väg 8 430 83 Vrångö
Re: German Umlauts / UTF8 with comparse
Christoph Lange wrote: > meaning, that the ä isn't recognized as being a letter within the > 'char-set:letter'. The utf8 egg’s srfi-14 character sets are designed to be compatible with the original srfi-14 and only contain ASCII characters, as stated in the documentation: https://wiki.call-cc.org/eggref/5/utf8#unicode-char-sets “The default SRFI-14 char-sets are defined using ASCII-only characters” You might want to import the unicode-char-sets module, and use one of its sets, like char-set:alphabetic. I hope this helps. :)
Re: German Umlauts / UTF8 with comparse
Hi Christoph, On 17 February 2020 14:31 +01, Christoph Lange wrote: > meaning, that the ä isn't recognized as being a letter within the > 'char-set:letter'. (The UTF8 aspect of correct character width works on the > other hand: in the remaining string, the ä is represented by only one #\. > If I don't use the UTF8 string equivalents by importing 'utf8', it would be > two.) this is because char-set:letter is not redefined by `utf8-srfi-13`. You can import `unicode-char-sets` which should give you what you need. See http://wiki.call-cc.org/eggref/5/utf8#unicode-char-sets for a list of char-sets supported by that module. Hope that helps! Moritz
German Umlauts / UTF8 with comparse
I read older threads about parsing Japanese with comparse and took some ideas from there, but am still stuck: (import comparse utf8 utf8-srfi-14) (define s "Gänsesäger 2,1") (define s1 "Rotkehlchen 1,0") (define (utf8-in cs) (satisfies (lambda (c) (char-set-contains? cs c (define letter (utf8-in char-set:letter)) (define letters (as-string (repeated letter 1 20))) This is what I have, and the beginning 'word' in the beginning of s1 is parsed completely and correctly with the 'letters' parser: #;1> (parse letters (string->list s1)) "Rotkehlchen" # ; 2 values For 's' though I get this: #;2> (parse letters (string->list s)) "G" # ; 2 values meaning, that the ä isn't recognized as being a letter within the 'char-set:letter'. (The UTF8 aspect of correct character width works on the other hand: in the remaining string, the ä is represented by only one #\. If I don't use the UTF8 string equivalents by importing 'utf8', it would be two.) Any hint for me? /Christoph -- Christoph Lange Lotsarnas Väg 8 430 83 Vrångö