I've been looking at the compatibility of IRI handling, in error and
warning cases. As part of that, I ran Jena 3.17.0 "riot --check" over
wikidata.
The question is what to do about things that are not right and how much
to tolerate in.
Should default behaviour be tightened up?
Which errors and warnings are actually useful?
Are there things that should be warnings that aren't?
(URI, IRI used interchangeably)
The worst cases are errors that are illegal by the grammar of RFC 3986,
the RFC that defines the URI syntax. These can be allowed though with
errors/warnings.
In 3.17.0, parsing is chatty but does not enforce (i.e. make it a parse
error and stop) much other than passing the language grammar.
The Turtle/N-triples etc parsers find strings, without spaces.
IRIREF ::= '<' ([^#x00-#x20<>"{}|^`\] | UCHAR)* '>'
Wikidata (5.4B triples) has a few that are RFC3986 syntax errors:
It also has shown that Jena 4.0.0 as it is today is too strict.
code/count/violation name.
Code:0 29 ILLEGAL_CHARACTER
- does not pass RFC3986
e.g. 2#'s or port number not digits
Code:5 7 CONTROL_CHARACTER
e.g. \u009, encoding errors.
Code:30 9 ILLEGAL_PERCENT_ENCODING
e.g. %2% or %RR
which is really quite good.
There is use of deprecated (by RFC 3986) "user:password@"
In Wikidata most common issue is the discouraged, but not illegal, use
of ports
Code:12 127 PORT_SHOULD_NOT_BE_EMPTY
Code:13 412 DEFAULT_PORT_SHOULD_BE_OMITTED
Code:14 1019 PORT_SHOULD_NOT_BE_WELL_KNOWN
Code:15 0 PORT_SHOULD_NOT_START_In_ZERO
Some more information:
https://afs.github.io/rdf-iri-syntax.html
Andy
Details, details, ...
== RFC3986 syntax errors
These are syntax error by RFC 3986:
Not just "bad practice"- syntax errors by the grammar.
Two fragments
<irc://freenode.net/##cwm>
<http://example/a#b#c>
(Two "?" is legal)
Bad percent encoding; RFC 3986 says "2 hex digits".
<http://example.org/abc%2%F/>
<http://example.org/abc%RR/>
Ports:
<https://:www.example.org/>
":" starts the port, so there is no host and the port isn't digit
nor a path because that starts at "/"
<http://www.example/abc def>
Space
In IRIs, other "whitespace" above U+00FF is legal.
== Legal by RFC 3986, but not by RFC 7230 (HTTP scheme)
Schemes can add further rules.
<https:///example.org>
<https:///path>
<https://?query>
<https://#frag>
The host is missing
Legal syntax by RFC398
Illegal by RFC 7230 - the http URI scheme.
== Legal by RFC 3986 syntax, but not by RFC 8141, latest URN RFC
RFC 8141 makes some things illegal which were legal.
A URN is "urn:NID:NSS".
NID must be 2+ characters, no starting with a digit.
NSS must be 1+ characters
These are legal by RFC 3986 (general URIs), but not by RFC 8141 (URNs):
<urn:abc:>
<urn:x:test>
That last one may catch people out in test data.
== Password and users
The use of "userinfo" is deprecated in RFC3986 Section 3.2.1
<cvs://:pserver:cvs:@cvs.cvsnt.org:/cvsnt>
<http://[email protected]/>
<http://[email protected]>
== Legal but ...
<http://www.example.com:/>
Legal!
<http://https://www.example.org/>
Legal!
scheme=http
host=https
path=//www.example.org/
== Other
OK by RFC3986
But illegal DNS name.
<http://www.1=example.com/>
<http://www..example.com/>
<http://younggu-art.com%20>
== file:
There is now an RFC! RFC 8089
Despite what people may say, it was previously well defined ... but
(1) only absolute paths
(2) only ASCII characters.
RFC 8089 allows relative filenames.
It allows many more characters in the path including "~"
This is what Jena and other systems have done all along.
file://host/ is still legal. Jena grumbles because it can't be used.
file:/// is legal.