afs commented on issue #2102:
URL: https://github.com/apache/jena/issues/2102#issuecomment-1825417655

   Side note: Parsing in SPARQL and parsing in Turtle are signficantly 
dofefrent in the way dubious (error or warning) IRIs are treated.
   
   The W3C specs define the IRI token as 
   
   ```
   IRIREF ::=  `<' ([^#x00-#x20<>"{}|^`\] | [UCHAR] '>'
   ```
   and then expect further checking for the legality of the string that matches 
that rule.
   
   Jena's SPARQL parser, ARQ, uses that rule (via javacc) then performs IRI 
validation.
   
   Jena's Turtle parser uses a custom tokenizer and does more limited checking 
on the characters between `<` and `>` , then performs IRI validation. Because 
the Turtle tokenizer is custom, the messages are more human-meaningful.
   
   Any IRI validation has to parse the string so it duplicates the character 
exclusion rules of `IRIREF`.
   
   This is all known and intended by the W3C working groups - both specs 
intentionally did not include the full RFC3986/3986  grammar. It is quite large 
and it would have to be modified for UCHAR.  `UCHAR` escapes  mean later checks 
are necessary anyway. It does not fit well with a standard parser/tokenizer 
split.
   
   An effect is that `{` (not the `UCHAR` way of doing that) is illegal surface 
syntax in SPARQL (an error that stops the parser) but a warning in Turtle.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to