On 23/09/2024 16:28, Arne Bernhardt wrote:
Hello,
I have been trying to speed up the parsing of RDF/XML in Apache Jena.

The reason is that our customers are introducing new data formats. This
involves replacing small (<1MB) ASCII files originating from punch cards
with large (>40MB) RDF/XML formats. However, the expectation is that the
processing speed will not increase. ;-)

Can you provide a copy of the test data?

Performance can be sensitive to XML shape.

Of the RRX parsers, StaX/EV is the slowest. It was written first and it is faster than ARP but not as much as I'd hoped.

Stax/SR is a fairly simple conversion of the event code to use StreamReader is faster than EV.

SAX is fastest sometimes, and sometimes it's Stax/SR.

TBH Once it was much faster than ARP, I didn't push on the performance as much.


For testing I was using RDF from Uniport - citations.rdf has many long strings (literals of several lines of text). 3.6G uncompressed. 30e6 triples.

https://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/citations.rdf.xz

Figures as of yesterday:
Files on an SSD.
Running of the development tree (getting class files from jars makes a measurable difference).

riot --syntax arp1 --time --count --sink citations.rdf
citations.rdf   : Triples = 30,358,427
citations.rdf   : 61.17 sec : 30,358,427 Triples : 496,287.90 per second

riot --syntax rrxstaxev --time --count --sink citations.rdf
citations.rdf   : Triples = 30,358,427
citations.rdf   : 52.22 sec : 30,358,427 Triples : 581,411.99 per second

riot --syntax rrxstaxsr --time --count --sink citations.rdf
citations.rdf   : Triples = 30,358,427
citations.rdf   : 31.86 sec : 30,358,427 Triples : 952,929.47 per second

riot --syntax rrxsax --time --count --sink citations.rdf
citations.rdf   : Triples = 30,358,427
citations.rdf   : 32.32 sec : 30,358,427 Triples : 939,220.59 per second

The first thing I tried was to integrate other parsers than Xerces.
But when I added the dependencies, also the unchanged riot/lang/rdfxml
implementations started to use either woodstox or aalto.
The reason was in org.apache.jena.riot.lang.rdfxml.SysRRX, where "
XMLInputFactory.newInstance();" is used. To continue my comparisons, I used
"XMLInputFactory.newDefaultFactory()" instead.
--> So, when I bundle Jena with another application where aalto is a
dependency, aalto may be used?

As I understand it, "XMLInputFactory.newInstance()" returned an instance from the system configured parser and "XMLInputFactory.newDefaultFactory()" is always the built-in XMLInputFactoryImpl.

Parsers are configured to be safe via setup in JenaXMLInput

Yes, other parsers can be used.

Should Jena always use the built-in one?
SysRRX could have a function that passes in the factor generator to allow an independent choice.

That would allow Jena provide RDF/XML with one parser yet allow the app to have XML processing elsewhere with a different choice of XML parser.

If this is potentially dangerous, as mentioned in
https://issues.apache.org/jira/browse/JENA-2331, shouldn´t we use "
XMLInputFactory.newDefaultFactory()" to ensure, that no unknown 3rd
party parser is used?

JENA-2331 is fixed, I hope.
The setup should work with woodstox.
(the code JenaXMLInput.initXMLInputFactory pokes around and should not set woodstox unsupported properties)

Is there still a problem?

In my benchmarks, woodstox is a bit faster than the default, but with
larger files aalto is even faster. Unfortunately aalto seems to be almost
inactive.

What speeds were you getting?

Do you have any suggestions on fast xml parsers or is there any work on a
faster rdf/xml parser?

   Arne


Reply via email to