jena-iri is difficult to maintain as Arne recently discovered. There is
only so much tweaking that can be done to it. It was written to be
standalone, general and modular. It covers several URI schemes that
don't arise for RDF/linked data/knowledge graphs. It is out-of-date with
more recent RFC revisions.
-- iri4ld
https://github.com/afs/x4ld/blob/main/iri4ld/README.md
iri4ld is an attempt at a IRI checker/parser that is more maintainable
and focused on linked data needs. It supports syntax parsing RFC3986/7
(i.e UTF-8 not restricted to ASCII) as a single pass and a single object
allocation; there is no regular expression usage in syntax parsing. It
has no runtime dependencies.
It also provides URI scheme validation for the latest RFCs for:
http:, https, did:, file:,
urn:, urn:uuid:, urn:oid:, example:
and non-standard schemes
uuid: and oid:
More can be added as a code change.
Unknown URIs schemes are not rejected - they get the syntax parsing but
no further validation.
If anyone works with data with other URI schemes, please email with a
link to the definition.
-- iri4ld in Jena
Jena has an comprehensive IRIx test suite that covers the expectations.
It captures user feedback based on what users encounter.
IRIProvider is the Jena adapter interface for an IRI library
implementation [1]. Like jena-iri, error and warning handling of iri4ld
can be configured. A provider can tuned and adapt the exposed behaviour
of the IRI library.
In Jena, all scheme specific violations become RIOT warnings (as
before). Only syntax errors for IRIs are RIOT parser errors - these
cause problems when writing legal RDF data.
iri4ld has different text for messages (there is a way to use jena-iri
messages but IMO the jena-iri messages are a bit cryptic). This is the
biggest change.
-- Current status
iri4ld passes the Jena test suite except it has thrown up a one area to
investigate. Should UTF-8 characters in URNs be allowed? [2]
There will be further checking to do - this is a change to something
that underpins a lot of the system.
Changeing should happen at the beginning of a development cycle.
-- What to do with jena-iri?
iri4ld uses jena-iri in its test suite to compare (with a lot of noted
exceptions).
jena-iri could be moved, and the package and module name reused for
iri4ld, or we could have a new jena-<somename> and leave jena-iri in
place or we could retire jena-iri (then the new IRI code would not test
against it but by now the iri4ld test suite and the Jena IRIx test
suites should be covering everything).
Andy
[1] The JDK URI parsing of java.net.URI is very out of date. It is RFC
2396, with some tweaks. There is already a IRIProviderJDK for
java.net.URI. The Jena test suite shows it is a long way off.
[2] Background; URNs are strictly ASCII. The key question for linked
data usage is whether to allow UTF-8 beyond ASCII in URNs general. There
are arguments for and against.