bvosburgh-tq opened a new issue, #1551:
URL: https://github.com/apache/jena/issues/1551
### Version
4.6.1
### What happened?
When the Turtle parser encounters a tab (`'\t'`) in a URI/IRI, the character
is treated slightly differently than other similar "bad"/"illegal" characters.
When using a Turtle parser configured for "lax" handling of invalid URIs,
like this:
```
RDFParser.create()
.source(in)
.lang(Lang.TURTLE)
.resolveURIs(false)
.errorHandler(ErrorHandlerFactory.errorHandlerWarning(null))
.parse(model);
```
and the parser encounters a tab in a URI, the result is a
`NullPointerException` in later parser processing:
```
java.lang.NullPointerException: Cannot invoke "String.startsWith(String)"
because "iri" is null
at org.apache.jena.riot.system.RiotLib.isBNodeIRI(RiotLib.java:107)
at
org.apache.jena.riot.system.ParserProfileStd.createURI(ParserProfileStd.java:185)
at
org.apache.jena.riot.system.ParserProfileStd.create(ParserProfileStd.java:259)
at
org.apache.jena.riot.lang.LangTurtleBase.tokenAsNode(LangTurtleBase.java:577)
at
org.apache.jena.riot.lang.LangTurtleBase.node(LangTurtleBase.java:410)
at
org.apache.jena.riot.lang.LangTurtleBase.triplesNode(LangTurtleBase.java:445)
at
org.apache.jena.riot.lang.LangTurtleBase.objectList(LangTurtleBase.java:419)
at
org.apache.jena.riot.lang.LangTurtleBase.predicateObjectItem(LangTurtleBase.java:352)
at
org.apache.jena.riot.lang.LangTurtleBase.predicateObjectList(LangTurtleBase.java:333)
at
org.apache.jena.riot.lang.LangTurtleBase.triples(LangTurtleBase.java:314)
at
org.apache.jena.riot.lang.LangTurtleBase.triplesSameSubject(LangTurtleBase.java:178)
at
org.apache.jena.riot.lang.LangTurtle.oneTopLevelElement(LangTurtle.java:46)
at
org.apache.jena.riot.lang.LangTurtleBase.runParser(LangTurtleBase.java:79)
at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:43)
```
Other problematic characters (e.g. `'{'`, `'}'`, `'"'`) are handled more
gracefully: They generate a call to `ErrorHandler.warning(...)` and, if the
error handler does not throw an exception (as in the "lax" case, above), the
parser leaves the character in the URI and continues processing.
It seems tabs should be handled the same way.
### Relevant output and stacktrace
_No response_
### Are you interested in making a pull request?
Yes
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]