Fast RDF/XML parsing - woodstox, aalto and alternatives

Arne Bernhardt Mon, 23 Sep 2024 08:28:51 -0700

Hello,
I have been trying to speed up the parsing of RDF/XML in Apache Jena.


The reason is that our customers are introducing new data formats. This
involves replacing small (<1MB) ASCII files originating from punch cards
with large (>40MB) RDF/XML formats. However, the expectation is that the
processing speed will not increase. ;-)

The first thing I tried was to integrate other parsers than Xerces.
But when I added the dependencies, also the unchanged riot/lang/rdfxml
implementations started to use either woodstox or aalto.
The reason was in org.apache.jena.riot.lang.rdfxml.SysRRX, where "
XMLInputFactory.newInstance();" is used. To continue my comparisons, I used
"XMLInputFactory.newDefaultFactory()" instead.
--> So, when I bundle Jena with another application where aalto is a
dependency, aalto may be used?
If this is potentially dangerous, as mentioned in
https://issues.apache.org/jira/browse/JENA-2331, shouldn´t we use "
XMLInputFactory.newDefaultFactory()" to ensure, that no unknown 3rd
party parser is used?

In my benchmarks, woodstox is a bit faster than the default, but with
larger files aalto is even faster. Unfortunately aalto seems to be almost
inactive.

Do you have any suggestions on fast xml parsers or is there any work on a
faster rdf/xml parser?

  Arne

Fast RDF/XML parsing - woodstox, aalto and alternatives

Reply via email to