I've been looking at the compatibility of IRI handling, in error and warning cases. As part of that, I ran Jena 3.17.0 "riot --check" over wikidata.

The question is what to do about things that are not right and how much to tolerate in.

Should default behaviour be tightened up?
Which errors and warnings are actually useful?
Are there things that should be warnings that aren't?

(URI, IRI used interchangeably)


The worst cases are errors that are illegal by the grammar of RFC 3986, the RFC that defines the URI syntax. These can be allowed though with errors/warnings.

In 3.17.0, parsing is chatty but does not enforce (i.e. make it a parse error and stop) much other than passing the language grammar.
The Turtle/N-triples etc parsers find strings, without spaces.

IRIREF  ::=     '<' ([^#x00-#x20<>"{}|^`\] | UCHAR)* '>'

Wikidata (5.4B triples) has a few that are RFC3986 syntax errors:
It also has shown that Jena 4.0.0 as it is today is too strict.

code/count/violation name.

Code:0   29   ILLEGAL_CHARACTER
   - does not pass RFC3986
   e.g. 2#'s or port number not digits

Code:5    7   CONTROL_CHARACTER
    e.g. \u009, encoding errors.

Code:30   9   ILLEGAL_PERCENT_ENCODING
   e.g. %2% or %RR

which is really quite good.

There is use of deprecated (by RFC 3986) "user:password@"

In Wikidata most common issue is the discouraged, but not illegal, use of ports

Code:12 127   PORT_SHOULD_NOT_BE_EMPTY
Code:13 412   DEFAULT_PORT_SHOULD_BE_OMITTED
Code:14 1019  PORT_SHOULD_NOT_BE_WELL_KNOWN
Code:15 0     PORT_SHOULD_NOT_START_In_ZERO

Some more information:
https://afs.github.io/rdf-iri-syntax.html

    Andy

Details, details, ...

== RFC3986 syntax errors

These are syntax error by RFC 3986:
Not just "bad practice"- syntax errors by the grammar.

Two fragments
<irc://freenode.net/##cwm>
<http://example/a#b#c>

(Two "?" is legal)

Bad percent encoding; RFC 3986 says "2 hex digits".
<http://example.org/abc%2%F/>
<http://example.org/abc%RR/>

Ports:
<https://:www.example.org/>
":" starts the port, so there is no host and the port isn't digit
nor a path because that starts at "/"

<http://www.example/abc      def>
  Space
  In IRIs, other "whitespace" above U+00FF is legal.

== Legal by RFC 3986, but not by RFC 7230 (HTTP scheme)

Schemes can add further rules.

<https:///example.org>
<https:///path>
<https://?query>
<https://#frag>
  The host is missing
  Legal syntax by RFC398
  Illegal by RFC 7230 - the http URI scheme.

==  Legal by RFC 3986 syntax, but not by RFC 8141, latest URN RFC

RFC 8141 makes some things illegal which were legal.

A URN is "urn:NID:NSS".
NID must be 2+ characters, no starting with a digit.
NSS must be 1+ characters

These are legal by RFC 3986 (general URIs), but not by RFC 8141 (URNs):

<urn:abc:>
<urn:x:test>

That last one may catch people out in test data.

== Password and users

The use of "userinfo" is deprecated in RFC3986 Section 3.2.1
<cvs://:pserver:cvs:@cvs.cvsnt.org:/cvsnt>
<http://[email protected]/>
<http://[email protected]>

== Legal but ...

<http://www.example.com:/>
  Legal!

<http://https://www.example.org/>
  Legal!
  scheme=http
  host=https
  path=//www.example.org/

== Other

OK by RFC3986
But illegal DNS name.
<http://www.1=example.com/>
<http://www..example.com/>
<http://younggu-art.com%20>

== file:

There is now an RFC! RFC 8089
Despite what people may say, it was previously well defined ... but
(1) only absolute paths
(2) only ASCII characters.

RFC 8089 allows relative filenames.
It allows many more characters in the path including "~"

This is what Jena and other systems have done all along.
file://host/ is still legal. Jena grumbles because it can't be used.
file:/// is legal.

Reply via email to