bvosburgh-tq opened a new issue, #1551:
URL: https://github.com/apache/jena/issues/1551

   ### Version
   
   4.6.1
   
   ### What happened?
   
   When the Turtle parser encounters a tab (`'\t'`) in a URI/IRI, the character 
is treated slightly differently than other similar "bad"/"illegal" characters.
   
   When using a Turtle parser configured for "lax" handling of invalid URIs, 
like this:
   ```
   RDFParser.create()
        .source(in)
        .lang(Lang.TURTLE)
        .resolveURIs(false)
        .errorHandler(ErrorHandlerFactory.errorHandlerWarning(null))
        .parse(model);
   ```
   and the parser encounters a tab in a URI, the result is a 
`NullPointerException` in later parser processing:
   ```
   java.lang.NullPointerException: Cannot invoke "String.startsWith(String)" 
because "iri" is null
        at org.apache.jena.riot.system.RiotLib.isBNodeIRI(RiotLib.java:107)
        at 
org.apache.jena.riot.system.ParserProfileStd.createURI(ParserProfileStd.java:185)
        at 
org.apache.jena.riot.system.ParserProfileStd.create(ParserProfileStd.java:259)
        at 
org.apache.jena.riot.lang.LangTurtleBase.tokenAsNode(LangTurtleBase.java:577)
        at 
org.apache.jena.riot.lang.LangTurtleBase.node(LangTurtleBase.java:410)
        at 
org.apache.jena.riot.lang.LangTurtleBase.triplesNode(LangTurtleBase.java:445)
        at 
org.apache.jena.riot.lang.LangTurtleBase.objectList(LangTurtleBase.java:419)
        at 
org.apache.jena.riot.lang.LangTurtleBase.predicateObjectItem(LangTurtleBase.java:352)
        at 
org.apache.jena.riot.lang.LangTurtleBase.predicateObjectList(LangTurtleBase.java:333)
        at 
org.apache.jena.riot.lang.LangTurtleBase.triples(LangTurtleBase.java:314)
        at 
org.apache.jena.riot.lang.LangTurtleBase.triplesSameSubject(LangTurtleBase.java:178)
        at 
org.apache.jena.riot.lang.LangTurtle.oneTopLevelElement(LangTurtle.java:46)
        at 
org.apache.jena.riot.lang.LangTurtleBase.runParser(LangTurtleBase.java:79)
        at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:43)
   ```
   Other problematic characters (e.g. `'{'`, `'}'`, `'"'`) are handled more 
gracefully: They generate a call to `ErrorHandler.warning(...)` and, if the 
error handler does not throw an exception (as in the "lax" case, above), the 
parser leaves the character in the URI and continues processing.
   
   It seems tabs should be handled the same way.
   
   ### Relevant output and stacktrace
   
   _No response_
   
   ### Are you interested in making a pull request?
   
   Yes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to