Re: Fast RDF/XML parsing - woodstox, aalto and alternatives

Arne Bernhardt Wed, 25 Sep 2024 09:16:44 -0700

Hi Andy,

first to the tiny fixes to support Aalto and Woodstox:
[image: image.png]
Aalto has a bug, where it initializes the default namespace prefix with
`null` rather than `""` (
https://github.com/FasterXML/aalto-xml/blob/master/src/main/java/com/fasterxml/aalto/in/NsBinding.java#L59)
but `getNamespaceURI(String prefix)` requires prefix to be not null (
https://github.com/FasterXML/aalto-xml/blob/master/src/main/java/com/fasterxml/aalto/in/FixedNsContext.java#L102).
So I used the dirty workaround.


I extended initXmlInputFactory to get rid of the warnings for Aalto:
[image: image.png]

I tried to use the SAX-parser implementation with woodstox and aalto but
there were a lot of tests not running and fixing this did not seem to be as
easy as it was with the STAX-parser. So I stopped working on that.


As test data, I use the ENTSO-E conformity assessment test data for the
current CGMES version 2.4.15. Which are downloadable as "
TestConfigurations_packageCASv2.0.zip" on
https://www.entsoe.eu/data/cim/cim-conformity-and-interoperability/. ( I
also found them here:
 
https://github.com/derrickoswald/CIMSpark/blob/master/CIMReader/data/CGMES_v2.4.15_TestConfigurations_v4.0.3.zip
<https://github.com/derrickoswald/CIMSpark/blob/master/CIMReader/data/CGMES_v2.4.15_TestConfigurations_v4.0.3.zip>
)
The size of the "RealGrid" example is close to what we see for a larger
electricity grid in Europe, as long as they only export the simpler
bus-branch models. We have 35 such grids in Europe, each operated by one
transport system operator (TSO). In the near future, they will exchange
much more detailed node-breaker models in order to keep the grid stable.
Data from all 35 grids is required to calculate congestion and redispatch.
The "RealGrid" only represents one timestamp and we normally have 24, each
representing one hour.
So we need to read approx. "RealGrid" x 35 TSOs x 24 hours.

Since we have  many files to process and  long running processes, I do not
care about the single-shot performance. The performance after warm up,
which is nice to benchmark with JMH, is important to me. To prevent the
optimizer from messing around with my benchmark, I read into a GraphMem.
(RRX.RDFXML_StAX2_s is woodstox)
Benchmark
     (param0_GraphUri)        (param1_ParserLang)  Mode  Cnt  Score   Error
 Units
TestXMLParser.parseXMLUsingBufferedInputStream
CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml
RRX.RDFXML_SAX  avgt   30  1,348 ± 0,008   s/op
TestXMLParser.parseXMLUsingBufferedInputStream
CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml
RRX.RDFXML_StAX_ev  avgt   30  1,736 ± 0,009   s/op
TestXMLParser.parseXMLUsingBufferedInputStream
CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml
RRX.RDFXML_StAX_sr  avgt   30  1,350 ± 0,011   s/op
TestXMLParser.parseXMLUsingBufferedInputStream
CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml
 RRX.RDFXML_StAX2_sr  avgt   30  1,266 ± 0,020   s/op
TestXMLParser.parseXMLUsingBufferedInputStream
CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml
 RRX.RDFXML_StAX2_sr_aalto  avgt   30  1,142 ± 0,009   s/op
TestXMLParser.parseXMLUsingBufferedInputStream
 CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml
RRX.RDFXML_SAX  avgt   30  0,194 ± 0,003   s/op
TestXMLParser.parseXMLUsingBufferedInputStream
 CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml
RRX.RDFXML_StAX_ev  avgt   30  0,229 ± 0,003   s/op
TestXMLParser.parseXMLUsingBufferedInputStream
 CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml
RRX.RDFXML_StAX_sr  avgt   30  0,188 ± 0,003   s/op
TestXMLParser.parseXMLUsingBufferedInputStream
 CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml
 RRX.RDFXML_StAX2_sr  avgt   30  0,182 ± 0,003   s/op
TestXMLParser.parseXMLUsingBufferedInputStream
 CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml
 RRX.RDFXML_StAX2_sr_aalto  avgt   30  0,170 ± 0,003   s/op

We may process the different profiles (especially EQ, SSH and TP) in
parallel, but then the largest one is our bottleneck. We read each profile
into a separate graph.
Especially the "EQ"-profiles (equipment profile with "_EQ_" in the name)
are quite large compared to the other profiles. So I focus especially on
this profile.

Here benchmarks with citations.rdf and bsbm-5m (converted into rdf/xml):
Benchmark                                       (param0_GraphUri)
 (param1_ParserLang)  Mode  Cnt   Score     Error  Units
TestXMLParser.parseXMLUsingBufferedInputStream        bsbm-5m.xml
  RRX.RDFXML_SAX  avgt    3  13,508 ±   6,208   s/op
TestXMLParser.parseXMLUsingBufferedInputStream        bsbm-5m.xml
RRX.RDFXML_StAX_ev  avgt    3  16,123 ±  18,367   s/op
TestXMLParser.parseXMLUsingBufferedInputStream        bsbm-5m.xml
RRX.RDFXML_StAX_sr  avgt    3  13,939 ±   2,823   s/op
TestXMLParser.parseXMLUsingBufferedInputStream        bsbm-5m.xml
 RRX.RDFXML_StAX2_sr  avgt    3  13,048 ±   2,160   s/op
TestXMLParser.parseXMLUsingBufferedInputStream        bsbm-5m.xml
 RRX.RDFXML_StAX2_sr_aalto  avgt    3  11,934 ±   2,509   s/op
TestXMLParser.parseXMLUsingBufferedInputStream      citations.rdf
  RRX.RDFXML_SAX  avgt    3  60,623 ±  47,619   s/op
TestXMLParser.parseXMLUsingBufferedInputStream      citations.rdf
RRX.RDFXML_StAX_ev  avgt    3  94,292 ±  77,405   s/op
TestXMLParser.parseXMLUsingBufferedInputStream      citations.rdf
RRX.RDFXML_StAX_sr  avgt    3  70,636 ±  48,869   s/op
TestXMLParser.parseXMLUsingBufferedInputStream      citations.rdf
 RRX.RDFXML_StAX2_sr  avgt    3  69,441 ±  12,561   s/op
TestXMLParser.parseXMLUsingBufferedInputStream      citations.rdf
 RRX.RDFXML_StAX2_sr_aalto  avgt    3  56,479 ±   6,724   s/op

In all current RRX-parsers, many IRIs are parsed twice, when "private Node
iriResolve(String uriStr, ..." is called.
So I added  "public Node createURI(IRIx iriX, long line, long col);"  to
the ParserProfile, which simply uses the given IRI instead of resolving it
again.
Then I changed iriResolve(...) so that if the IRI needs to be resolved
and is therefore known, the new function of the ParserProfile is called.
This makes quite a difference:

Benchmark
     (param0_GraphUri)        (param1_ParserLang)  Mode  Cnt  Score   Error
 Units
TestXMLParser.parseXMLUsingBufferedInputStream
CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml
RRX.RDFXML_SAX  avgt   15  1,016 ± 0,023   s/op
TestXMLParser.parseXMLUsingBufferedInputStream
CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml
RRX.RDFXML_StAX_sr  avgt   15  1,025 ± 0,020   s/op
TestXMLParser.parseXMLUsingBufferedInputStream
CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml
 RRX.RDFXML_StAX2_sr  avgt   15  0,913 ± 0,020   s/op
TestXMLParser.parseXMLUsingBufferedInputStream
CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml
 RRX.RDFXML_StAX2_sr_aalto  avgt   15  0,913 ± 0,125   s/op
TestXMLParser.parseXMLUsingBufferedInputStream
 CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml
RRX.RDFXML_SAX  avgt   15  0,128 ± 0,003   s/op
TestXMLParser.parseXMLUsingBufferedInputStream
 CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml
RRX.RDFXML_StAX_sr  avgt   15  0,126 ± 0,010   s/op
TestXMLParser.parseXMLUsingBufferedInputStream
 CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml
 RRX.RDFXML_StAX2_sr  avgt   15  0,123 ± 0,002   s/op
TestXMLParser.parseXMLUsingBufferedInputStream
 CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml
 RRX.RDFXML_StAX2_sr_aalto  avgt   15  0,113 ± 0,003   s/op

Benchmark                                       (param0_GraphUri)
 (param1_ParserLang)  Mode  Cnt   Score    Error  Units
TestXMLParser.parseXMLUsingBufferedInputStream        bsbm-5m.xml
  RRX.RDFXML_SAX  avgt    3  11,637 ±  5,626   s/op
TestXMLParser.parseXMLUsingBufferedInputStream        bsbm-5m.xml
RRX.RDFXML_StAX_sr  avgt    3  11,381 ±  0,302   s/op
TestXMLParser.parseXMLUsingBufferedInputStream        bsbm-5m.xml
 RRX.RDFXML_StAX2_sr  avgt    3  10,713 ±  3,734   s/op
TestXMLParser.parseXMLUsingBufferedInputStream        bsbm-5m.xml
 RRX.RDFXML_StAX2_sr_aalto  avgt    3   9,782 ±  1,532   s/op
TestXMLParser.parseXMLUsingBufferedInputStream      citations.rdf
  RRX.RDFXML_SAX  avgt    3  52,060 ± 17,449   s/op
TestXMLParser.parseXMLUsingBufferedInputStream      citations.rdf
RRX.RDFXML_StAX_sr  avgt    3  52,267 ±  3,415   s/op
TestXMLParser.parseXMLUsingBufferedInputStream      citations.rdf
 RRX.RDFXML_StAX2_sr  avgt    3  52,982 ±  6,149   s/op
TestXMLParser.parseXMLUsingBufferedInputStream      citations.rdf
 RRX.RDFXML_StAX2_sr_aalto  avgt    3  46,060 ±  6,228   s/op

Are there any tests for ParserProfile implementations or the integration
with the parsers? I could not find any.

Next, I will try some caching of nodes and possibly try to integrate
https://github.com/afs/x4ld/tree/main/iri4ld.
Do you have any other ideas or suggestions on how to proceed?

  Arne

Am Di., 24. Sept. 2024 um 11:18 Uhr schrieb Andy Seaborne <a...@apache.org>:

>
>
> On 23/09/2024 16:28, Arne Bernhardt wrote:
> > Hello,
> > I have been trying to speed up the parsing of RDF/XML in Apache Jena.
> >
> > The reason is that our customers are introducing new data formats. This
> > involves replacing small (<1MB) ASCII files originating from punch cards
> > with large (>40MB) RDF/XML formats. However, the expectation is that the
> > processing speed will not increase. ;-)
>
> Can you provide a copy of the test data?
>
> Performance can be sensitive to XML shape.
>
> Of the RRX parsers, StaX/EV is the slowest. It was written first and it
> is faster than ARP but not as much as I'd hoped.
>
> Stax/SR is a fairly simple conversion of the event code to use
> StreamReader is faster than EV.
>
> SAX is fastest sometimes, and sometimes it's Stax/SR.
>
> TBH Once it was much faster than ARP, I didn't push on the performance
> as much.
>
>
> For testing I was using RDF from Uniport - citations.rdf has many long
> strings (literals of several lines of text). 3.6G uncompressed. 30e6
> triples.
>
>
> https://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/citations.rdf.xz
>
> Figures as of yesterday:
> Files on an SSD.
> Running of the development tree (getting class files from jars makes a
> measurable difference).
>
> riot --syntax arp1 --time --count --sink citations.rdf
> citations.rdf   : Triples = 30,358,427
> citations.rdf   : 61.17 sec : 30,358,427 Triples : 496,287.90 per second
>
> riot --syntax rrxstaxev --time --count --sink citations.rdf
> citations.rdf   : Triples = 30,358,427
> citations.rdf   : 52.22 sec : 30,358,427 Triples : 581,411.99 per second
>
> riot --syntax rrxstaxsr --time --count --sink citations.rdf
> citations.rdf   : Triples = 30,358,427
> citations.rdf   : 31.86 sec : 30,358,427 Triples : 952,929.47 per second
>
> riot --syntax rrxsax --time --count --sink citations.rdf
> citations.rdf   : Triples = 30,358,427
> citations.rdf   : 32.32 sec : 30,358,427 Triples : 939,220.59 per second
>
> > The first thing I tried was to integrate other parsers than Xerces.
> > But when I added the dependencies, also the unchanged riot/lang/rdfxml
> > implementations started to use either woodstox or aalto.
> > The reason was in org.apache.jena.riot.lang.rdfxml.SysRRX, where "
> > XMLInputFactory.newInstance();" is used. To continue my comparisons, I
> used
> > "XMLInputFactory.newDefaultFactory()" instead.
> > --> So, when I bundle Jena with another application where aalto is a
> > dependency, aalto may be used?
>
> As I understand it, "XMLInputFactory.newInstance()" returned an instance
> from the system configured parser and
> "XMLInputFactory.newDefaultFactory()" is always the built-in
> XMLInputFactoryImpl.
>
> Parsers are configured to be safe via setup in JenaXMLInput
>
> Yes, other parsers can be used.
>
> Should Jena always use the built-in one?
> SysRRX could have a function that passes in the factor generator to
> allow an independent choice.
>
> That would allow Jena provide RDF/XML with one parser yet allow the app
> to have XML processing elsewhere with a different choice of XML parser.
>
> > If this is potentially dangerous, as mentioned in
> > https://issues.apache.org/jira/browse/JENA-2331, shouldn´t we use "
> > XMLInputFactory.newDefaultFactory()" to ensure, that no unknown 3rd
> > party parser is used?
>
> JENA-2331 is fixed, I hope.
> The setup should work with woodstox.
> (the code JenaXMLInput.initXMLInputFactory pokes around and should not
> set woodstox unsupported properties)
>
> Is there still a problem?
>
> > In my benchmarks, woodstox is a bit faster than the default, but with
> > larger files aalto is even faster. Unfortunately aalto seems to be almost
> > inactive.
>
> What speeds were you getting?
>
> > Do you have any suggestions on fast xml parsers or is there any work on a
> > faster rdf/xml parser?
> >
> >    Arne
> >
>
>

Re: Fast RDF/XML parsing - woodstox, aalto and alternatives

Reply via email to