Re: Fast RDF/XML parsing - woodstox, aalto and alternatives

Andy Seaborne Sun, 29 Sep 2024 08:27:46 -0700

The images didn't come through.



On 25/09/2024 17:16, Arne Bernhardt wrote:

Hi Andy,

first to the tiny fixes to support Aalto and Woodstox:


-- Missing image

Aalto has a bug, where it initializes the default namespace prefixwith `null` rather than `""`(https://github.com/FasterXML/aalto-xml/blob/master/src/main/java/com/fasterxml/aalto/in/NsBinding.java#L59)but `getNamespaceURI(String prefix)` requires prefix to be not null(https://github.com/FasterXML/aalto-xml/blob/master/src/main/java/com/fasterxml/aalto/in/FixedNsContext.java#L102).So I used the dirty workaround.
I extended initXmlInputFactory to get rid of the warnings for Aalto:


-- Missing image

I tried to use the SAX-parser implementation with woodstox and aaltobut there were a lot of tests not running and fixing this did not seemto be as easy as it was with the STAX-parser. So I stopped working onthat.
As test data, I use the ENTSO-E conformity assessment test data forthe current CGMES version 2.4.15. Which are downloadable as "TestConfigurations_packageCASv2.0.zip" onhttps://www.entsoe.eu/data/cim/cim-conformity-and-interoperability/. (I also found themhere: https://github.com/derrickoswald/CIMSpark/blob/master/CIMReader/data/CGMES_v2.4.15_TestConfigurations_v4.0.3.zip<https://github.com/derrickoswald/CIMSpark/blob/master/CIMReader/data/CGMES_v2.4.15_TestConfigurations_v4.0.3.zip>)

The size of the "RealGrid" example is close to what we see for alarger electricity grid in Europe, as long as they only export thesimpler bus-branch models. We have 35 such grids in Europe, eachoperated by one transport system operator (TSO). In the near future,they will exchange much more detailed node-breaker models in order tokeep the grid stable. Data from all 35 grids is required to calculatecongestion and redispatch. The "RealGrid" only represents onetimestamp and we normally have 24, each representing one hour.
So we need to read approx. "RealGrid" x 35 TSOs x 24 hours.

Time for some parallelism of the data streams :-)

Do you have a pointer to exactly which data? Don't want to be looking atthe schema rather than the data!

Since we have many files to process and long running processes, I donot care about the single-shot performance. The performance after warmup, which is nice to benchmark with JMH, is important to me. Toprevent the optimizer from messing around with my benchmark, I readinto a GraphMem.
(RRX.RDFXML_StAX2_s is woodstox)
Benchmark (param0_GraphUri) (param1_ParserLang) Mode Cnt Score Error UnitsTestXMLParser.parseXMLUsingBufferedInputStream CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml RRX.RDFXML_SAX avgt 30 1,348 ± 0,008 s/opTestXMLParser.parseXMLUsingBufferedInputStreamCGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml RRX.RDFXML_StAX_ev avgt 30 1,736 ± 0,009 s/opTestXMLParser.parseXMLUsingBufferedInputStreamCGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml RRX.RDFXML_StAX_sr avgt 30 1,350 ± 0,011 s/opTestXMLParser.parseXMLUsingBufferedInputStreamCGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml RRX.RDFXML_StAX2_sr avgt 30 1,266 ± 0,020 s/opTestXMLParser.parseXMLUsingBufferedInputStreamCGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml RRX.RDFXML_StAX2_sr_aalto avgt 30 1,142 ± 0,009 s/opTestXMLParser.parseXMLUsingBufferedInputStream CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml RRX.RDFXML_SAX avgt 30 0,194 ± 0,003 s/opTestXMLParser.parseXMLUsingBufferedInputStream CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml RRX.RDFXML_StAX_ev avgt 30 0,229 ± 0,003 s/opTestXMLParser.parseXMLUsingBufferedInputStream CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml RRX.RDFXML_StAX_sr avgt 30 0,188 ± 0,003 s/opTestXMLParser.parseXMLUsingBufferedInputStream CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml RRX.RDFXML_StAX2_sr avgt 30 0,182 ± 0,003 s/opTestXMLParser.parseXMLUsingBufferedInputStream CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml RRX.RDFXML_StAX2_sr_aalto avgt 30 0,170 ± 0,003 s/op
We may process the different profiles (especially EQ, SSH and TP) inparallel, but then the largest one is our bottleneck. We read eachprofile into a separate graph.Especially the "EQ"-profiles (equipment profile with "_EQ_" in thename) are quite large compared to the other profiles. So I focusespecially on this profile.
Here benchmarks with citations.rdf and bsbm-5m (converted into rdf/xml):
Benchmark (param0_GraphUri) (param1_ParserLang) Mode Cnt Score Error UnitsTestXMLParser.parseXMLUsingBufferedInputStream bsbm-5m.xml RRX.RDFXML_SAX avgt 3 13,508 ± 6,208 s/opTestXMLParser.parseXMLUsingBufferedInputStream bsbm-5m.xml RRX.RDFXML_StAX_ev avgt 3 16,123 ± 18,367 s/opTestXMLParser.parseXMLUsingBufferedInputStream bsbm-5m.xml RRX.RDFXML_StAX_sr avgt 3 13,939 ± 2,823 s/opTestXMLParser.parseXMLUsingBufferedInputStream bsbm-5m.xml RRX.RDFXML_StAX2_sr avgt 3 13,048 ± 2,160 s/opTestXMLParser.parseXMLUsingBufferedInputStream bsbm-5m.xml RRX.RDFXML_StAX2_sr_aalto avgt 3 11,934 ± 2,509 s/opTestXMLParser.parseXMLUsingBufferedInputStream citations.rdf RRX.RDFXML_SAX avgt 3 60,623 ± 47,619 s/opTestXMLParser.parseXMLUsingBufferedInputStream citations.rdf RRX.RDFXML_StAX_ev avgt 3 94,292 ± 77,405 s/opTestXMLParser.parseXMLUsingBufferedInputStream citations.rdf RRX.RDFXML_StAX_sr avgt 3 70,636 ± 48,869 s/opTestXMLParser.parseXMLUsingBufferedInputStream citations.rdf RRX.RDFXML_StAX2_sr avgt 3 69,441 ± 12,561 s/opTestXMLParser.parseXMLUsingBufferedInputStream citations.rdf RRX.RDFXML_StAX2_sr_aalto avgt 3 56,479 ± 6,724 s/op
In all current RRX-parsers, many IRIs are parsed twice, when "privateNode iriResolve(String uriStr, ..." is called.So I added "public Node createURI(IRIx iriX, long line, long col);"to the ParserProfile, which simply uses the given IRI instead ofresolving it again.Then I changed iriResolve(...) so that if the IRI needs to be resolvedand is therefore known, the new function of the ParserProfile is called.

RRX process is not exactly like Turtle et al. it's more IRI processdependent because of qnames.


A cache of IRI parsing could go in

1. An IRIProvider - this is then shared across parser runs.
2. In IRIxResolver - that should then be shared between RRX and
   ParserProfile. It could be system wide or per parser run.
3. In a parser - or in a subobject shared across the three RRX parsers
   (your PR has multiple cache - haven't got ot the bottom of those yet)
4. In the parser profile in makeInternalURI
5. As a system wide cache (via provider and/or resolver).

Downside of all shared caches would be the cache overhead to be thread-safe.

This makes quite a difference:



Tried it - I see a good improvement.

It is good see the JMH but can we also can we discuss real usage?

A big enough file that cmd overhead is negligible and riot figures.citation.rdf?(BSBM has many large literals making it "content-centric" whichincidentally, an XML should do very well on.)


What are you getting on your hardware for Jena 5.10 and your branch?

Benchmark (param0_GraphUri) (param1_ParserLang) Mode Cnt Score Error UnitsTestXMLParser.parseXMLUsingBufferedInputStreamCGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml RRX.RDFXML_SAX avgt 15 1,016 ± 0,023 s/opTestXMLParser.parseXMLUsingBufferedInputStreamCGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml RRX.RDFXML_StAX_sr avgt 15 1,025 ± 0,020 s/opTestXMLParser.parseXMLUsingBufferedInputStreamCGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml RRX.RDFXML_StAX2_sr avgt 15 0,913 ± 0,020 s/opTestXMLParser.parseXMLUsingBufferedInputStreamCGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml RRX.RDFXML_StAX2_sr_aalto avgt 15 0,913 ± 0,125 s/opTestXMLParser.parseXMLUsingBufferedInputStream CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml RRX.RDFXML_SAX avgt 15 0,128 ± 0,003 s/opTestXMLParser.parseXMLUsingBufferedInputStream CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml RRX.RDFXML_StAX_sr avgt 15 0,126 ± 0,010 s/opTestXMLParser.parseXMLUsingBufferedInputStream CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml RRX.RDFXML_StAX2_sr avgt 15 0,123 ± 0,002 s/opTestXMLParser.parseXMLUsingBufferedInputStream CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml RRX.RDFXML_StAX2_sr_aalto avgt 15 0,113 ± 0,003 s/op
Benchmark (param0_GraphUri) (param1_ParserLang) Mode Cnt Score Error UnitsTestXMLParser.parseXMLUsingBufferedInputStream bsbm-5m.xml RRX.RDFXML_SAX avgt 3 11,637 ± 5,626 s/opTestXMLParser.parseXMLUsingBufferedInputStream bsbm-5m.xml RRX.RDFXML_StAX_sr avgt 3 11,381 ± 0,302 s/opTestXMLParser.parseXMLUsingBufferedInputStream bsbm-5m.xml RRX.RDFXML_StAX2_sr avgt 3 10,713 ± 3,734 s/opTestXMLParser.parseXMLUsingBufferedInputStream bsbm-5m.xml RRX.RDFXML_StAX2_sr_aalto avgt 3 9,782 ± 1,532 s/opTestXMLParser.parseXMLUsingBufferedInputStream citations.rdf RRX.RDFXML_SAX avgt 3 52,060 ± 17,449 s/opTestXMLParser.parseXMLUsingBufferedInputStream citations.rdf RRX.RDFXML_StAX_sr avgt 3 52,267 ± 3,415 s/opTestXMLParser.parseXMLUsingBufferedInputStream citations.rdf RRX.RDFXML_StAX2_sr avgt 3 52,982 ± 6,149 s/opTestXMLParser.parseXMLUsingBufferedInputStream citations.rdf RRX.RDFXML_StAX2_sr_aalto avgt 3 46,060 ± 6,228 s/op
Are there any tests for ParserProfile implementations or theintegration with the parsers? I could not find any.

Not as test ParserProfile directly - there is TestFactoryRDF, FactoryRDFis doing most of the work.Every parser has one or more eval test suites (Jena and also W3C) thatuses ParserProfile extensively because that's the purpose of ParserProfile.

Next, I will try some caching of nodes and possibly try to integratehttps://github.com/afs/x4ld/tree/main/iri4ld.
Do you have any other ideas or suggestions on how to proceed?

I'll write about iri4ld soon.

The main reason for iri4ld is (un)maintainability of jena-iri but it ismuch more lightweight maybe to the point where caches have little ornegative effect.


  Arne

Am Di., 24. Sept. 2024 um 11:18 Uhr schrieb Andy Seaborne<a...@apache.org>:




    On 23/09/2024 16:28, Arne Bernhardt wrote:
    > Hello,
    > I have been trying to speed up the parsing of RDF/XML in Apache
    Jena.
    >
    > The reason is that our customers are introducing new data
    formats. This
    > involves replacing small (<1MB) ASCII files originating from
    punch cards
    > with large (>40MB) RDF/XML formats. However, the expectation is
    that the
    > processing speed will not increase. ;-)

    Can you provide a copy of the test data?

    Performance can be sensitive to XML shape.

    Of the RRX parsers, StaX/EV is the slowest. It was written first
    and it
    is faster than ARP but not as much as I'd hoped.

    Stax/SR is a fairly simple conversion of the event code to use
    StreamReader is faster than EV.

    SAX is fastest sometimes, and sometimes it's Stax/SR.

    TBH Once it was much faster than ARP, I didn't push on the
    performance
    as much.


    For testing I was using RDF from Uniport - citations.rdf has many
    long
    strings (literals of several lines of text). 3.6G uncompressed. 30e6
    triples.

    
https://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/citations.rdf.xz

    Figures as of yesterday:
    Files on an SSD.
    Running of the development tree (getting class files from jars
    makes a
    measurable difference).

    riot --syntax arp1 --time --count --sink citations.rdf
    citations.rdf   : Triples = 30,358,427
    citations.rdf   : 61.17 sec : 30,358,427 Triples : 496,287.90 per
    second

    riot --syntax rrxstaxev --time --count --sink citations.rdf
    citations.rdf   : Triples = 30,358,427
    citations.rdf   : 52.22 sec : 30,358,427 Triples : 581,411.99 per
    second

    riot --syntax rrxstaxsr --time --count --sink citations.rdf
    citations.rdf   : Triples = 30,358,427
    citations.rdf   : 31.86 sec : 30,358,427 Triples : 952,929.47 per
    second

    riot --syntax rrxsax --time --count --sink citations.rdf
    citations.rdf   : Triples = 30,358,427
    citations.rdf   : 32.32 sec : 30,358,427 Triples : 939,220.59 per
    second

    > The first thing I tried was to integrate other parsers than Xerces.
    > But when I added the dependencies, also the unchanged
    riot/lang/rdfxml
    > implementations started to use either woodstox or aalto.
    > The reason was in org.apache.jena.riot.lang.rdfxml.SysRRX, where "
    > XMLInputFactory.newInstance();" is used. To continue my
    comparisons, I used
    > "XMLInputFactory.newDefaultFactory()" instead.
    > --> So, when I bundle Jena with another application where aalto is a
    > dependency, aalto may be used?

    As I understand it, "XMLInputFactory.newInstance()" returned an
    instance
    from the system configured parser and
    "XMLInputFactory.newDefaultFactory()" is always the built-in
    XMLInputFactoryImpl.

    Parsers are configured to be safe via setup in JenaXMLInput

    Yes, other parsers can be used.

    Should Jena always use the built-in one?
    SysRRX could have a function that passes in the factor generator to
    allow an independent choice.

    That would allow Jena provide RDF/XML with one parser yet allow
    the app
    to have XML processing elsewhere with a different choice of XML
    parser.

    > If this is potentially dangerous, as mentioned in
    > https://issues.apache.org/jira/browse/JENA-2331, shouldn´t we use "
    > XMLInputFactory.newDefaultFactory()" to ensure, that no unknown 3rd
    > party parser is used?

    JENA-2331 is fixed, I hope.
    The setup should work with woodstox.
    (the code JenaXMLInput.initXMLInputFactory pokes around and should
    not
    set woodstox unsupported properties)

    Is there still a problem?

    > In my benchmarks, woodstox is a bit faster than the default, but
    with
    > larger files aalto is even faster. Unfortunately aalto seems to
    be almost
    > inactive.

    What speeds were you getting?

    > Do you have any suggestions on fast xml parsers or is there any
    work on a
    > faster rdf/xml parser?
    >
    >    Arne
    >

Re: Fast RDF/XML parsing - woodstox, aalto and alternatives

Reply via email to