On Sunday, January 11, 2015 at 4:38:29 PM UTC-5, William Macready wrote:
>
> I've been using Julia for about a month now, and I'm really enjoying the
> language. My thanks to all who've contributed to it's development!
>
> I'm developing a parser for first-order logic, and wanted to use the logic
> symbols available in unicode. I've come across behaviour that I don't
> understand.
>
> In the REPL I define the string
> s = "¬(a<b)" with the unicode negation symbol (obtained from \neg<tab>) as
> the first element
>
> As I expected s[1] returns '¬', but s[2] returns the error
> ERROR: invalid UTF-8 character index
> in next at ./utf8.jl:68
> in getindex at string.jl:57
> Then s[3]='(' which I would have thought was at position 2. Similarly,
> length(s)=6, but s[6]='b'.
>
>
For efficiency, the index in a UTF8String is a byte index, not a character
index. This is because UTF-8 is a variable-length encoding: each
character (codepoint) can occupy from 1 to 4 bytes. In particular, the
codepoint '¬' uses a two-byte encoding, so the next codepoint in the string
starts at s[3].
For sequential string processing, you can use the nextind and prevind
functions to find the next/prevoius valid indices in the string. e.g.
nextind(s,1) in your example above yields 3, and s[3] gives '('. In
practice, virtually all string processing is sequential (starting at the
beginning of the string or at previously computed indices), so UTF-8 string
processing is efficient.
Alternatively, you can use the chr2ind function to convert a character
index into a byte index. e.g. chr2ind(s,2) gives the byte index of the
start of the second codepoint in s, which in your case gives 3. However,
in practice this is rarely needed, which is good because it is relatively
slow (it requires Julia to loop through the string). (The only time I've
needed it was to convert indices from one encoding to another.)
Regex matching returns the byte index, which is what you want: that lets
you efficiently jump to that point in the string. That is why
match(r"a",s).offset
returns 4: this correct, because the character 'a' indeed starts at the 4th
byte of s and s[4] == 'a'.
See
also http://docs.julialang.org/en/latest/manual/strings/#unicode-and-utf-8
There is some discussion of using a special string indexing type to hide
this complexity, but it raises some subtle tradeoffs and nothing has been
decided yet: https://github.com/JuliaLang/julia/issues/9297