On 02/10/11 00:25, Tim Harsch wrote:
I was trying to send the SPARQL DAWG test
"i18n/normalization-01.ttl"
through RiotReader.createParserTurtle

when I call parse I get
Caught: org.openjena.riot.RiotException: [line: 19, col: 12] Unknown char: 
?(769)

(this message is UTF-8 - how it looks will depend on your email client - not all of them get it right)

Yes, RIOT is UTF-8 aware.

If you look at normalization-01.ttl, you'll see it says:

[] foaf:name "Alice" ;
  HR:resumé "Alice's normalized resumé"  .

[] foaf:name "Bob" ;
  HR:resumé "Bob's non-normalized resumé" .  <<--- This is line 19

Note that second "resumé " is e-followed by an accent as a combining character. i.e. 2 characters, not one.

RIOT isn't accepting combing characters correctly - it should do.

Smaller test data:

----
1-é
2-é
----
and od -t x1:

31 2d c3 a9 0a
32 2d 65 cc 81 0a

Emacs handles it correctly; Thunderbird 3.1.15 and Eclipse do not. They put the accent after the char, not over, i.e. treat as two characters. They are not performing Unicode normalization (its not part of UTF-8).

Java passes back:

x0031 1
x002d -
x00e9 é
x000a

x0032 2
x002d -
x0065 e
x0301 ́
x000a

i.e. two chars.

The other Turtle reader in Jena (the old one) gets this right because combining characters are in the grammar production for "nameChar" (characters after the first in a prefix name part). But it also means you can write utter junk like digit one followed by a combing character. RIOT ought to do the same.

http://www.w3.org/TeamSubmission/turtle/#nameChar

Thanks for the bug report,

        Andy

Reply via email to