On 26/10/2024 21:32, Andy Seaborne wrote:
Issue
   https://github.com/apache/jena/issue/2800
and PR
   https://github.com/apache/jena/pull/2801

The new code is faster when parsing using jena-iri3986.

When parsing BSBM 25m (or larger BSBM files - the parsing rate is only very slightly faster for larger files) from a SSD file of "nt.gz"

On my machine, with java21:
   Jena 5.2.0 is about 500kTPS
   PR/jena-iri is about 505kTPS, consistently a little faster then 5.2.0
   PR/jena-iri3986 is 620 kTPS (22% faster)

PR/jena-iri:      545-550 kTPS
PR/jena-iri3986:  685-690 kTPS

This is picking up the recent main updates and also changing to use CacheSimple.

    Andy


Instructions for trying it out on the PR - the default is the existing jena-iri IR provider.

     Andy

On 29/09/2024 19:06, Andy Seaborne wrote:

jena-iri is difficult to maintain as Arne recently discovered. There is only so much tweaking that can be done to it. It was written to be standalone, general and modular. It covers several URI schemes that don't arise for RDF/linked data/knowledge graphs. It is out-of-date with more recent RFC revisions.

-- iri4ld

https://github.com/afs/x4ld/blob/main/iri4ld/README.md

iri4ld is an attempt at a IRI checker/parser that is more maintainable
and focused on linked data needs. It supports syntax parsing RFC3986/7 (i.e UTF-8 not restricted to ASCII) as a single pass and a single object allocation; there is no regular expression usage in syntax parsing. It has no runtime dependencies.

It also provides URI scheme validation for the latest RFCs for:

     http:, https, did:, file:,
     urn:, urn:uuid:, urn:oid:, example:
and non-standard schemes
     uuid: and oid:

More can be added as a code change.

Unknown URIs schemes are not rejected - they get the syntax parsing but no further validation.

If anyone works with data with other URI schemes, please email with a link to the definition.

-- iri4ld in Jena

Jena has an comprehensive IRIx test suite that covers the expectations. It captures user feedback based on what users encounter.

IRIProvider is the Jena adapter interface for an IRI library implementation [1]. Like jena-iri, error and warning handling of iri4ld can be configured. A provider can tuned and adapt the exposed behaviour of the IRI library.

In Jena, all scheme specific violations become RIOT warnings (as before). Only syntax errors for IRIs are RIOT parser errors - these cause problems when writing legal RDF data.

iri4ld has different text for messages (there is a way to use jena-iri messages but IMO the jena-iri messages are a bit cryptic). This is the biggest change.

-- Current status

iri4ld passes the Jena test suite except it has thrown up a one area to investigate. Should UTF-8 characters in URNs be allowed? [2]

There will be further checking to do - this is a change to something that underpins a lot of the system.

Changeing should happen at the beginning of a development cycle.

-- What to do with jena-iri?

iri4ld uses jena-iri in its test suite to compare (with a lot of noted exceptions).

jena-iri could be moved, and the package and module name reused for iri4ld, or we could have a new jena-<somename> and leave jena-iri in place or we could retire jena-iri (then the new IRI code would not test against it but by now the iri4ld test suite and the Jena IRIx test suites should be covering everything).

     Andy

[1] The JDK URI parsing of java.net.URI is very out of date. It is RFC 2396, with some tweaks. There is already a IRIProviderJDK for java.net.URI. The Jena test suite shows it is a long way off.

[2] Background;  URNs are strictly ASCII. The key question for linked data usage is whether to allow UTF-8 beyond ASCII in URNs general. There are arguments for and against.



Reply via email to