Re: [racket] regexp operations on character input ports returning bytes

Matthew Flatt Sat, 25 Dec 2010 07:46:03 -0800

At Sat, 25 Dec 2010 10:23:54 -0500, Neil Van Dyke wrote:
> When doing a regexp on a character input port, what's the best way to 
> get string results out instead of bytes results?


Decode the results of `regexp-match' using `bytes->string/utf-8'.

> For example, this is documented behavior, but not actually what I want, 
> because I don't want to have to re-encode the bytes as a string (plus, I 
> would have to query the input port to find out what its character 
> encoding, if I don't know it a priori):

A string regexp on an input port matches via UTF-8 encoding by
definition, so you can always use UTF-8.

If some layer of the input has a different encoding, it's handled by
conversion to a UTF-8 encoding at the port level.

> do "regexp-match-peek-positions" as a peek and then use "read-string" 

That doesn't work, because you don't know how many characters to read
given the positions in bytes.

> Is there a better way using regexp operations on input ports?

No. Decoding bytes to a string using UTF-8 has to happen at some level,
so there are not really any efficiency or generality issues in
performing the decoding on the result of `regexp-match'.

_________________________________________________
  For list-related administrative tasks:
  http://lists.racket-lang.org/listinfo/users

Re: [racket] regexp operations on character input ports returning bytes

Reply via email to