Re: German Umlauts / UTF8 with comparse
Ah!, right. Thanks! ... if I remember correctly, that was also discussed in the older mail thread about parsing Japanese, when Moritz said that he didn't want to make comparse users dependent on utf8. Works well now, and also thanks for mentioning the ,d trick! On Tue, Feb 18, 2020 at 12:44 PM wrote: > Christoph Lange wrote: > > Yes, this helps. Kind of ;-) ... using the character set > > char-set:alphabetic, my umlauts are now parsed. But I don't get them back > > in my result, at least not as printable characters. Instead, the > following > > happens, and utterly confuses me: > > Hmm, indeed. From what I can see, the result of parse is not encoded in > UTF-8. > > I went to see comparse’s code and found that the (as-string) combiner > uses (->string) internally. But since comparse doesn’t use the utf8 egg, > it uses the core version of (->string), which happens to encode #\ä in > latin-1! > > The only workaround I can think of right now is to move the conversion > back to a string out of the comparse egg and into your own, utf8 aware, > code. > > This would look something like this: > > > (import comparse utf8 utf8-srfi-14 unicode-char-sets) > > (define s "Gänsesäger 2,1") > (define s1 "Rotkehlchen 1,0") > > (define (utf8-in cs) > (satisfies (lambda (c) (char-set-contains? cs c > > (define letter > (utf8-in char-set:alphabetic)) > > (define letters > (repeated letter 1 20)) > > (define (parse-as-string parser input) > (list->string (parse parser input))) > > (define p1 (parse-as-string letters (string->list s1))) > (define p (parse-as-string letters (string->list s))) > > > PS: a trick I used to check the encoding of the strings was using the ,d > csi command, which prints the contents of the string byte by byte. There > it’s easy to see if non ascii characters indeed take more than one byte > as they should in UTF-8. > -- Christoph Lange Lotsarnas Väg 8 430 83 Vrångö
Re: German Umlauts / UTF8 with comparse
Christoph Lange wrote: > Yes, this helps. Kind of ;-) ... using the character set > char-set:alphabetic, my umlauts are now parsed. But I don't get them back > in my result, at least not as printable characters. Instead, the following > happens, and utterly confuses me: Hmm, indeed. From what I can see, the result of parse is not encoded in UTF-8. I went to see comparse’s code and found that the (as-string) combiner uses (->string) internally. But since comparse doesn’t use the utf8 egg, it uses the core version of (->string), which happens to encode #\ä in latin-1! The only workaround I can think of right now is to move the conversion back to a string out of the comparse egg and into your own, utf8 aware, code. This would look something like this: (import comparse utf8 utf8-srfi-14 unicode-char-sets) (define s "Gänsesäger 2,1") (define s1 "Rotkehlchen 1,0") (define (utf8-in cs) (satisfies (lambda (c) (char-set-contains? cs c (define letter (utf8-in char-set:alphabetic)) (define letters (repeated letter 1 20)) (define (parse-as-string parser input) (list->string (parse parser input))) (define p1 (parse-as-string letters (string->list s1))) (define p (parse-as-string letters (string->list s))) PS: a trick I used to check the encoding of the strings was using the ,d csi command, which prints the contents of the string byte by byte. There it’s easy to see if non ascii characters indeed take more than one byte as they should in UTF-8.
Re: German Umlauts / UTF8 with comparse
Yes, this helps. Kind of ;-) ... using the character set char-set:alphabetic, my umlauts are now parsed. But I don't get them back in my result, at least not as printable characters. Instead, the following happens, and utterly confuses me: #;2> (define s3 (parse letters (string->list s))) #;3> s3 "Gnsesger" #;4> (string-length s3) 6 #;5> (string->list s3) (#\G #\x4bb3 #\e #\s #\x49e5 #\r) #;6> (list->string (string->list s3)) "G䮳es䧥r" So, I put the parse result into 's3'. Printing it, I read an eight character string, namely the one I want, minus my beloved umlauts. 'string-length' returns that string to be six characters long, and 'string->list' gives me exactly that, swallowing still other ASCII characters of my string and reversing that using 'list->string' includes Chinese ... even though '(list->string (string->list s1))', with my pure ASCII string, reverses without fault. I guess I have some problems understanding some utf8 concepts?! /Christoph On Mon, Feb 17, 2020 at 3:38 PM wrote: > Christoph Lange wrote: > > meaning, that the ä isn't recognized as being a letter within the > > 'char-set:letter'. > > The utf8 egg’s srfi-14 character sets are designed to be compatible with > the original srfi-14 and only contain ASCII characters, as stated in the > documentation: > https://wiki.call-cc.org/eggref/5/utf8#unicode-char-sets > “The default SRFI-14 char-sets are defined using ASCII-only characters” > > You might want to import the unicode-char-sets module, and use one of its > sets, like char-set:alphabetic. > > I hope this helps. :) > -- Christoph Lange Lotsarnas Väg 8 430 83 Vrångö
Re: German Umlauts / UTF8 with comparse
Christoph Lange wrote: > meaning, that the ä isn't recognized as being a letter within the > 'char-set:letter'. The utf8 egg’s srfi-14 character sets are designed to be compatible with the original srfi-14 and only contain ASCII characters, as stated in the documentation: https://wiki.call-cc.org/eggref/5/utf8#unicode-char-sets “The default SRFI-14 char-sets are defined using ASCII-only characters” You might want to import the unicode-char-sets module, and use one of its sets, like char-set:alphabetic. I hope this helps. :)
Re: German Umlauts / UTF8 with comparse
Hi Christoph, On 17 February 2020 14:31 +01, Christoph Lange wrote: > meaning, that the ä isn't recognized as being a letter within the > 'char-set:letter'. (The UTF8 aspect of correct character width works on the > other hand: in the remaining string, the ä is represented by only one #\. > If I don't use the UTF8 string equivalents by importing 'utf8', it would be > two.) this is because char-set:letter is not redefined by `utf8-srfi-13`. You can import `unicode-char-sets` which should give you what you need. See http://wiki.call-cc.org/eggref/5/utf8#unicode-char-sets for a list of char-sets supported by that module. Hope that helps! Moritz