Re: RIOT UTF-8 aware?

Andy Seaborne Sun, 02 Oct 2011 02:43:12 -0700

On 02/10/11 00:25, Tim Harsch wrote:

I was trying to send the SPARQL DAWG test
"i18n/normalization-01.ttl"
through RiotReader.createParserTurtle


when I call parse I get
Caught: org.openjena.riot.RiotException: [line: 19, col: 12] Unknown char: 
?(769)

(this message is UTF-8 - how it looks will depend on your email client -not all of them get it right)


Yes, RIOT is UTF-8 aware.

If you look at normalization-01.ttl, you'll see it says:

[] foaf:name "Alice" ;
  HR:resumé "Alice's normalized resumé"  .

[] foaf:name "Bob" ;
  HR:resumé "Bob's non-normalized resumé" .  <<--- This is line 19

Note that second "resumé " is e-followed by an accent as a combiningcharacter. i.e. 2 characters, not one.


RIOT isn't accepting combing characters correctly - it should do.

Smaller test data:

----
1-é
2-é
----
and od -t x1:

31 2d c3 a9 0a
32 2d 65 cc 81 0a

Emacs handles it correctly; Thunderbird 3.1.15 and Eclipse do not. Theyput the accent after the char, not over, i.e. treat as two characters.They are not performing Unicode normalization (its not part of UTF-8).


Java passes back:

x0031 1
x002d -
x00e9 é
x000a

x0032 2
x002d -
x0065 e
x0301 ́
x000a

i.e. two chars.

The other Turtle reader in Jena (the old one) gets this right becausecombining characters are in the grammar production for "nameChar"(characters after the first in a prefix name part). But it also meansyou can write utter junk like digit one followed by a combing character.RIOT ought to do the same.


http://www.w3.org/TeamSubmission/turtle/#nameChar

Thanks for the bug report,

        Andy

Re: RIOT UTF-8 aware?

Reply via email to