On 02/10/11 00:25, Tim Harsch wrote:
I was trying to send the SPARQL DAWG test
"i18n/normalization-01.ttl"
through RiotReader.createParserTurtle
when I call parse I get
Caught: org.openjena.riot.RiotException: [line: 19, col: 12] Unknown char:
?(769)
(this message is UTF-8 - how it looks will depend on your email client -
not all of them get it right)
Yes, RIOT is UTF-8 aware.
If you look at normalization-01.ttl, you'll see it says:
[] foaf:name "Alice" ;
HR:resumé "Alice's normalized resumé" .
[] foaf:name "Bob" ;
HR:resumé "Bob's non-normalized resumé" . <<--- This is line 19
Note that second "resumé " is e-followed by an accent as a combining
character. i.e. 2 characters, not one.
RIOT isn't accepting combing characters correctly - it should do.
Smaller test data:
----
1-é
2-é
----
and od -t x1:
31 2d c3 a9 0a
32 2d 65 cc 81 0a
Emacs handles it correctly; Thunderbird 3.1.15 and Eclipse do not. They
put the accent after the char, not over, i.e. treat as two characters.
They are not performing Unicode normalization (its not part of UTF-8).
Java passes back:
x0031 1
x002d -
x00e9 é
x000a
x0032 2
x002d -
x0065 e
x0301 ́
x000a
i.e. two chars.
The other Turtle reader in Jena (the old one) gets this right because
combining characters are in the grammar production for "nameChar"
(characters after the first in a prefix name part). But it also means
you can write utter junk like digit one followed by a combing character.
RIOT ought to do the same.
http://www.w3.org/TeamSubmission/turtle/#nameChar
Thanks for the bug report,
Andy