[julia-users] Re: question on utf8 strings

Steven G. Johnson Mon, 12 Jan 2015 08:52:58 -0800


On Sunday, January 11, 2015 at 4:38:29 PM UTC-5, William Macready wrote:
>
> I've been using Julia for about a month now, and I'm really enjoying the 
> language. My thanks to all who've contributed to it's development!
>
> I'm developing a parser for first-order logic, and wanted to use the logic 
> symbols available in unicode. I've come across behaviour that I don't 
> understand.
>
> In the REPL I define the string
> s = "¬(a<b)" with the unicode negation symbol (obtained from \neg<tab>) as 
> the first element
>
> As I expected s[1] returns '¬', but s[2] returns the error
> ERROR: invalid UTF-8 character index
>   in next at ./utf8.jl:68
>   in getindex at string.jl:57
> Then s[3]='(' which I would have thought was at position 2. Similarly, 
> length(s)=6, but s[6]='b'.
>
>
For efficiency, the index in a UTF8String is a byte index, not a character 
index.   This is because UTF-8 is a variable-length encoding: each 
character (codepoint) can occupy from 1 to 4 bytes.  In particular, the 
codepoint '¬' uses a two-byte encoding, so the next codepoint in the string 
starts at s[3].


For sequential string processing, you can use the nextind and prevind 
functions to find the next/prevoius valid indices in the string.  e.g. 
 nextind(s,1) in your example above yields 3, and s[3] gives '('.   In 
practice, virtually all string processing is sequential (starting at the 
beginning of the string or at previously computed indices), so UTF-8 string 
processing is efficient.

Alternatively, you can use the chr2ind function to convert a character 
index into a byte index.  e.g. chr2ind(s,2) gives the byte index of the 
start of the second codepoint in s, which in your case gives 3.  However, 
in practice this is rarely needed, which is good because it is relatively 
slow (it requires Julia to loop through the string).  (The only time I've 
needed it was to convert indices from one encoding to another.)

Regex matching returns the byte index, which is what you want: that lets 
you efficiently jump to that point in the string.  That is why 
match(r"a",s).offset 
returns 4: this correct, because the character 'a' indeed starts at the 4th 
byte of s and s[4] == 'a'.   

See 
also http://docs.julialang.org/en/latest/manual/strings/#unicode-and-utf-8

There is some discussion of using a special string indexing type to hide 
this complexity, but it raises some subtle tradeoffs and nothing has been 
decided yet: https://github.com/JuliaLang/julia/issues/9297

[julia-users] Re: question on utf8 strings

Reply via email to