Re: German Umlauts / UTF8 with comparse

2020-02-18 Thread Christoph Lange
Ah!, right. Thanks! ... if I remember correctly, that was also discussed in
the older mail thread about parsing Japanese, when Moritz said that he
didn't want to make comparse users dependent on utf8.

Works well now, and also thanks for mentioning the ,d trick!

On Tue, Feb 18, 2020 at 12:44 PM  wrote:

> Christoph Lange  wrote:
> > Yes, this helps. Kind of ;-) ... using the character set
> > char-set:alphabetic, my umlauts are now parsed. But I don't get them back
> > in my result, at least not as printable characters. Instead, the
> following
> > happens, and utterly confuses me:
>
> Hmm, indeed. From what I can see, the result of parse is not encoded in
> UTF-8.
>
> I went to see comparse’s code and found that the (as-string) combiner
> uses (->string) internally. But since comparse doesn’t use the utf8 egg,
> it uses the core version of (->string), which happens to encode #\ä in
> latin-1!
>
> The only workaround I can think of right now is to move the conversion
> back to a string out of the comparse egg and into your own, utf8 aware,
> code.
>
> This would look something like this:
>
>
> (import comparse utf8 utf8-srfi-14 unicode-char-sets)
>
> (define s "Gänsesäger 2,1")
> (define s1 "Rotkehlchen 1,0")
>
> (define (utf8-in cs)
>   (satisfies (lambda (c) (char-set-contains? cs c
>
> (define letter
>   (utf8-in char-set:alphabetic))
>
> (define letters
>   (repeated letter 1 20))
>
> (define (parse-as-string parser input)
>   (list->string (parse parser input)))
>
> (define p1 (parse-as-string letters (string->list s1)))
> (define p (parse-as-string letters (string->list s)))
>
>
> PS: a trick I used to check the encoding of the strings was using the ,d
> csi command, which prints the contents of the string byte by byte. There
> it’s easy to see if non ascii characters indeed take more than one byte
> as they should in UTF-8.
>


-- 
Christoph Lange
Lotsarnas Väg 8
430 83 Vrångö


Re: German Umlauts / UTF8 with comparse

2020-02-18 Thread kooda
Christoph Lange  wrote:
> Yes, this helps. Kind of ;-) ... using the character set
> char-set:alphabetic, my umlauts are now parsed. But I don't get them back
> in my result, at least not as printable characters. Instead, the following
> happens, and utterly confuses me:

Hmm, indeed. From what I can see, the result of parse is not encoded in
UTF-8.

I went to see comparse’s code and found that the (as-string) combiner
uses (->string) internally. But since comparse doesn’t use the utf8 egg,
it uses the core version of (->string), which happens to encode #\ä in
latin-1!

The only workaround I can think of right now is to move the conversion
back to a string out of the comparse egg and into your own, utf8 aware,
code.

This would look something like this:


(import comparse utf8 utf8-srfi-14 unicode-char-sets)

(define s "Gänsesäger 2,1")
(define s1 "Rotkehlchen 1,0")

(define (utf8-in cs)
  (satisfies (lambda (c) (char-set-contains? cs c

(define letter
  (utf8-in char-set:alphabetic))

(define letters
  (repeated letter 1 20))

(define (parse-as-string parser input)
  (list->string (parse parser input)))

(define p1 (parse-as-string letters (string->list s1)))
(define p (parse-as-string letters (string->list s)))


PS: a trick I used to check the encoding of the strings was using the ,d
csi command, which prints the contents of the string byte by byte. There
it’s easy to see if non ascii characters indeed take more than one byte
as they should in UTF-8.


Re: German Umlauts / UTF8 with comparse

2020-02-17 Thread Christoph Lange
Yes, this helps. Kind of ;-) ... using the character set
char-set:alphabetic, my umlauts are now parsed. But I don't get them back
in my result, at least not as printable characters. Instead, the following
happens, and utterly confuses me:

#;2> (define s3 (parse letters (string->list s)))
#;3> s3
"Gnsesger"
#;4> (string-length s3)
6
#;5> (string->list s3)
(#\G #\x4bb3 #\e #\s #\x49e5 #\r)
#;6> (list->string (string->list s3))
"G䮳es䧥r"


So, I put the parse result into 's3'. Printing it, I read an
eight character string, namely the one I want, minus my beloved umlauts.
'string-length' returns that string to be six characters long, and
'string->list' gives me exactly that, swallowing still other ASCII
characters of my string and reversing that using 'list->string' includes
Chinese ... even though '(list->string (string->list s1))', with my pure
ASCII string, reverses without fault.

I guess I have some problems understanding some utf8 concepts?!

/Christoph

On Mon, Feb 17, 2020 at 3:38 PM  wrote:

> Christoph Lange  wrote:
> > meaning, that the ä isn't recognized as being a letter within the
> > 'char-set:letter'.
>
> The utf8 egg’s srfi-14 character sets are designed to be compatible with
> the original srfi-14 and only contain ASCII characters, as stated in the
> documentation:
> https://wiki.call-cc.org/eggref/5/utf8#unicode-char-sets
> “The default SRFI-14 char-sets are defined using ASCII-only characters”
>
> You might want to import the unicode-char-sets module, and use one of its
> sets, like char-set:alphabetic.
>
> I hope this helps. :)
>


-- 
Christoph Lange
Lotsarnas Väg 8
430 83 Vrångö


Re: German Umlauts / UTF8 with comparse

2020-02-17 Thread kooda
Christoph Lange  wrote:
> meaning, that the ä isn't recognized as being a letter within the
> 'char-set:letter'.

The utf8 egg’s srfi-14 character sets are designed to be compatible with the 
original srfi-14 and only contain ASCII characters, as stated in the 
documentation:
https://wiki.call-cc.org/eggref/5/utf8#unicode-char-sets
“The default SRFI-14 char-sets are defined using ASCII-only characters”

You might want to import the unicode-char-sets module, and use one of its
sets, like char-set:alphabetic.

I hope this helps. :)


Re: German Umlauts / UTF8 with comparse

2020-02-17 Thread Moritz Heidkamp
Hi Christoph,

On 17 February 2020 14:31 +01, Christoph Lange wrote:

> meaning, that the ä isn't recognized as being a letter within the
> 'char-set:letter'. (The UTF8 aspect of correct character width works on the
> other hand: in the remaining string, the ä is represented by only one #\.
> If I don't use the UTF8 string equivalents by importing 'utf8', it would be
> two.)

this is because char-set:letter is not redefined by `utf8-srfi-13`. You
can import `unicode-char-sets` which should give you what you need. See
http://wiki.call-cc.org/eggref/5/utf8#unicode-char-sets for a list of
char-sets supported by that module.

Hope that helps!

Moritz