date:20200217

Re: German Umlauts / UTF8 with comparse

2020-02-17 Thread Christoph Lange

Yes, this helps. Kind of ;-) ... using the character set
char-set:alphabetic, my umlauts are now parsed. But I don't get them back
in my result, at least not as printable characters. Instead, the following
happens, and utterly confuses me:

#;2> (define s3 (parse letters (string->list s)))
#;3> s3
"Gnsesger"
#;4> (string-length s3)
6
#;5> (string->list s3)
(#\G #\x4bb3 #\e #\s #\x49e5 #\r)
#;6> (list->string (string->list s3))
"G䮳es䧥r"

So, I put the parse result into 's3'. Printing it, I read an
eight character string, namely the one I want, minus my beloved umlauts.
'string-length' returns that string to be six characters long, and
'string->list' gives me exactly that, swallowing still other ASCII
characters of my string and reversing that using 'list->string' includes
Chinese ... even though '(list->string (string->list s1))', with my pure
ASCII string, reverses without fault.

I guess I have some problems understanding some utf8 concepts?!

/Christoph

On Mon, Feb 17, 2020 at 3:38 PM  wrote:

> Christoph Lange  wrote:
> > meaning, that the ä isn't recognized as being a letter within the
> > 'char-set:letter'.
>
> The utf8 egg’s srfi-14 character sets are designed to be compatible with
> the original srfi-14 and only contain ASCII characters, as stated in the
> documentation:
> https://wiki.call-cc.org/eggref/5/utf8#unicode-char-sets
> “The default SRFI-14 char-sets are defined using ASCII-only characters”
>
> You might want to import the unicode-char-sets module, and use one of its
> sets, like char-set:alphabetic.
>
> I hope this helps. :)
>

-- 
Christoph Lange
Lotsarnas Väg 8
430 83 Vrångö

Re: German Umlauts / UTF8 with comparse

2020-02-17 Thread kooda

Christoph Lange  wrote:
> meaning, that the ä isn't recognized as being a letter within the
> 'char-set:letter'.

The utf8 egg’s srfi-14 character sets are designed to be compatible with the 
original srfi-14 and only contain ASCII characters, as stated in the 
documentation:
https://wiki.call-cc.org/eggref/5/utf8#unicode-char-sets
“The default SRFI-14 char-sets are defined using ASCII-only characters”

You might want to import the unicode-char-sets module, and use one of its
sets, like char-set:alphabetic.

I hope this helps. :)

Re: German Umlauts / UTF8 with comparse

2020-02-17 Thread Moritz Heidkamp

Hi Christoph,

On 17 February 2020 14:31 +01, Christoph Lange wrote:

> meaning, that the ä isn't recognized as being a letter within the
> 'char-set:letter'. (The UTF8 aspect of correct character width works on the
> other hand: in the remaining string, the ä is represented by only one #\.
> If I don't use the UTF8 string equivalents by importing 'utf8', it would be
> two.)

this is because char-set:letter is not redefined by `utf8-srfi-13`. You
can import `unicode-char-sets` which should give you what you need. See
http://wiki.call-cc.org/eggref/5/utf8#unicode-char-sets for a list of
char-sets supported by that module.

Hope that helps!

Moritz

German Umlauts / UTF8 with comparse

2020-02-17 Thread Christoph Lange

I read older threads about parsing Japanese with comparse and took some
ideas from there, but am still stuck:


(import comparse utf8 utf8-srfi-14)

(define s "Gänsesäger 2,1")
(define s1 "Rotkehlchen 1,0")

(define (utf8-in cs)
  (satisfies (lambda (c) (char-set-contains? cs c

(define letter
  (utf8-in char-set:letter))

(define letters
  (as-string (repeated letter 1 20)))



This is what I have, and the beginning 'word' in the beginning of s1 is
parsed completely and correctly with the 'letters' parser:

#;1> (parse letters (string->list s1))
"Rotkehlchen"
#
; 2 values


For 's' though I get this:


#;2> (parse letters (string->list s))
"G"
#
; 2 values



meaning, that the ä isn't recognized as being a letter within the
'char-set:letter'. (The UTF8 aspect of correct character width works on the
other hand: in the remaining string, the ä is represented by only one #\.
If I don't use the UTF8 string equivalents by importing 'utf8', it would be
two.)

Any hint for me?

/Christoph

-- 
Christoph Lange
Lotsarnas Väg 8
430 83 Vrångö

Re: German Umlauts / UTF8 with comparse

Re: German Umlauts / UTF8 with comparse

Re: German Umlauts / UTF8 with comparse

German Umlauts / UTF8 with comparse

4 matches

Site Navigation

Mail list logo

Footer information