Re: Fast RDF/XML parsing - woodstox, aalto and alternatives

Arne Bernhardt Sun, 29 Sep 2024 12:11:24 -0700

Am So., 29. Sept. 2024 um 17:27 Uhr schrieb Andy Seaborne <a...@apache.org>:


> The images didn't come through.
>
>
>
> On 25/09/2024 17:16, Arne Bernhardt wrote:
> > Hi Andy,
> >
> > first to the tiny fixes to support Aalto and Woodstox:
>
> -- Missing image
>

I uploaded the image to google drive:
https://drive.google.com/file/d/1hHKSkEVOhXstGI8ZgGoWdSvXgiUt5-u2/view?usp=drive_link


>
> > Aalto has a bug, where it initializes the default namespace prefix
> > with `null` rather than `""`
> > (
> https://github.com/FasterXML/aalto-xml/blob/master/src/main/java/com/fasterxml/aalto/in/NsBinding.java#L59)
>
> > but `getNamespaceURI(String prefix)` requires prefix to be not null
> > (
> https://github.com/FasterXML/aalto-xml/blob/master/src/main/java/com/fasterxml/aalto/in/FixedNsContext.java#L102).
>
> > So I used the dirty workaround.
> >
>

Meanwhile I created an issue for aalto
https://github.com/FasterXML/aalto-xml/issues/97 and a PR
https://github.com/FasterXML/aalto-xml/pull/98.
Unfortunately there haven't been any commits since June, I hope they are
still active.


> > I extended initXmlInputFactory to get rid of the warnings for Aalto:
>
> -- Missing image
> > I tried to use the SAX-parser implementation with woodstox and aalto
> > but there were a lot of tests not running and fixing this did not seem
> > to be as easy as it was with the STAX-parser. So I stopped working on
> > that.
> >
> >
> > As test data, I use the ENTSO-E conformity assessment test data for
> > the current CGMES version 2.4.15. Which are downloadable as "
> > TestConfigurations_packageCASv2.0.zip" on
> > https://www.entsoe.eu/data/cim/cim-conformity-and-interoperability/. (
> > I also found them
> > here:
> https://github.com/derrickoswald/CIMSpark/blob/master/CIMReader/data/CGMES_v2.4.15_TestConfigurations_v4.0.3.zip
> > <
> https://github.com/derrickoswald/CIMSpark/blob/master/CIMReader/data/CGMES_v2.4.15_TestConfigurations_v4.0.3.zip>
>
> > )
>
>
>
> > The size of the "RealGrid" example is close to what we see for a
> > larger electricity grid in Europe, as long as they only export the
> > simpler bus-branch models. We have 35 such grids in Europe, each
> > operated by one transport system operator (TSO). In the near future,
> > they will exchange much more detailed node-breaker models in order to
> > keep the grid stable. Data from all 35 grids is required to calculate
> > congestion and redispatch. The "RealGrid" only represents one
> > timestamp and we normally have 24, each representing one hour.
> > So we need to read approx. "RealGrid" x 35 TSOs x 24 hours.
> Time for some parallelism of the data streams :-)
>
> Do you have a pointer to exactly which data? Don't want to be looking at
> the schema rather than the data!
>

If I want to keep it generic I first need to load all the RDF schema graphs
to determine the datatypes of the literals, which in CIMXML are all
provided as string literals:
They are downloadable from
https://www.entsoe.eu/data/cim/cim-for-grid-models-exchange/#:~:text=Common%20Grid%20Model%20Exchange%20Standard
as
https://www.entsoe.eu/Documents/CIM_documents/Grid_Model_CIM/CGMES2415_Components_2020.zip

In a common case I would have to load these RDF schema files from
\CGMES2415_Components_2020\RDFS\ :
- FileHeader.rdf (the schema is existent in all files)
- EquipmentProfileCoreOperationRDFSAugmented-v2_4_15-4Sep2020.rdf (for
EQ-profiles)
- SteadyStateHypothesisProfileRDFSAugmented-v2_4_15-4Sep2020.rdf (for
SSH-profiles)
- TopologyProfileRDFSAugmented-v2_4_15-4Sep2020.rdf (for topology profiles)
- StateVariableProfileRDFSAugmented-v2_4_15-4Sep2020.rdf (for SSH-profiles)

Though, I might escape to extracting the datatypes and generate static code
for it.

Then I have the instance data, which must be read. From the example
CGMES_v2.4.15_TestConfigurations_v4.0.3.zip this would be:
- /RealGrid/CGMES_v2.4.15_RealGridTestConfiguration_v2.zip and there:
Most important:
- CGMES_v2.4.15_RealGridTestConfiguration_EQ_v2.xml (describes the
equipment of the grid, like wires, transformers, power plants etc.)
- CGMES_v2.4.15_RealGridTestConfiguration_SSH_v2.xml (describes the the
expected production and consumption as well as the state of the components,
like switch positions)
Needed in less cases but often enough:
- CGMES_v2.4.15_RealGridTestConfiguration_SV_v2.xml (describes the result
of a power flow calculation with all voltages and reactive power for the
power plants)
- CGMES_v2.4.15_RealGridTestConfiguration_TP_v2.xml (describes the current
topology of the grid, which is a reduced perspective to all currently
electrically connected elements)


> > Since we have many files to process and  long running processes, I do
> > not care about the single-shot performance. The performance after warm
> > up, which is nice to benchmark with JMH, is important to me. To
> > prevent the optimizer from messing around with my benchmark, I read
> > into a GraphMem.
> > (RRX.RDFXML_StAX2_s is woodstox)
> > Benchmark  (param0_GraphUri)        (param1_ParserLang)  Mode  Cnt
> >  Score   Error  Units
> > TestXMLParser.parseXMLUsingBufferedInputStream
> > CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml
> > RRX.RDFXML_SAX  avgt   30  1,348 ± 0,008   s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream
> > CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml   RRX.RDFXML_StAX_ev
> >  avgt   30  1,736 ± 0,009   s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream
> > CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml   RRX.RDFXML_StAX_sr
> >  avgt   30  1,350 ± 0,011   s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream
> > CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml  RRX.RDFXML_StAX2_sr
> >  avgt   30  1,266 ± 0,020   s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream
> > CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml
> >  RRX.RDFXML_StAX2_sr_aalto  avgt   30  1,142 ± 0,009 s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream
> >  CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml
> > RRX.RDFXML_SAX  avgt   30  0,194 ± 0,003   s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream
> >  CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml
> > RRX.RDFXML_StAX_ev  avgt   30  0,229 ± 0,003   s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream
> >  CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml
> > RRX.RDFXML_StAX_sr  avgt   30  0,188 ± 0,003   s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream
> >  CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml
> >  RRX.RDFXML_StAX2_sr  avgt   30  0,182 ± 0,003   s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream
> >  CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml
> >  RRX.RDFXML_StAX2_sr_aalto  avgt   30  0,170 ± 0,003 s/op
> >
> > We may process the different profiles (especially EQ, SSH and TP) in
> > parallel, but then the largest one is our bottleneck. We read each
> > profile into a separate graph.
> > Especially the "EQ"-profiles (equipment profile with "_EQ_" in the
> > name) are quite large compared to the other profiles. So I focus
> > especially on this profile.
> >
> > Here benchmarks with citations.rdf and bsbm-5m (converted into rdf/xml):
> > Benchmark             (param0_GraphUri)        (param1_ParserLang)
> >  Mode  Cnt   Score     Error  Units
> > TestXMLParser.parseXMLUsingBufferedInputStream  bsbm-5m.xml
> >   RRX.RDFXML_SAX  avgt    3  13,508 ±   6,208   s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream  bsbm-5m.xml
> > RRX.RDFXML_StAX_ev  avgt    3  16,123 ±  18,367   s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream  bsbm-5m.xml
> > RRX.RDFXML_StAX_sr  avgt    3  13,939 ±   2,823   s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream  bsbm-5m.xml
> >  RRX.RDFXML_StAX2_sr  avgt    3  13,048 ±   2,160   s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream  bsbm-5m.xml
> >  RRX.RDFXML_StAX2_sr_aalto  avgt    3  11,934 ±   2,509   s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream  citations.rdf
> >     RRX.RDFXML_SAX  avgt    3  60,623 ±  47,619   s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream  citations.rdf
> > RRX.RDFXML_StAX_ev  avgt    3  94,292 ±  77,405   s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream  citations.rdf
> > RRX.RDFXML_StAX_sr  avgt    3  70,636 ±  48,869   s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream  citations.rdf
> >  RRX.RDFXML_StAX2_sr  avgt    3  69,441 ±  12,561   s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream  citations.rdf
> >  RRX.RDFXML_StAX2_sr_aalto  avgt    3  56,479 ±   6,724   s/op
> >
> > In all current RRX-parsers, many IRIs are parsed twice, when "private
> > Node iriResolve(String uriStr, ..." is called.
> > So I added  "public Node createURI(IRIx iriX, long line, long col);"
> > to the ParserProfile, which simply uses the given IRI instead of
> > resolving it again.
> > Then I changed iriResolve(...) so that if the IRI needs to be resolved
> > and is therefore known, the new function of the ParserProfile is called.
>
> RRX process is not exactly like Turtle et al. it's more IRI process
> dependent because of qnames.
>
> A cache of IRI parsing could go in
>
>  1. An IRIProvider - this is then shared across parser runs.
>  2. In IRIxResolver - that should then be shared between RRX and
>     ParserProfile. It could be system wide or per parser run.
>  3. In a parser - or in a subobject shared across the three RRX parsers
>     (your PR has multiple cache - haven't got ot the bottom of those yet)
>  4. In the parser profile in makeInternalURI
>  5. As a system wide cache (via provider and/or resolver).
>
> Downside of all shared caches would be the cache overhead to be
> thread-safe.
>

A good idea, thank you.
This overhead of caffeine might be worth it, when I have 35 datasets all
sharing most of the IRIs.


> > This makes quite a difference:
>
>
> Tried it - I see a good improvement.
>
> It is good see the JMH but can we also can we discuss real usage?
>
> A big enough file that cmd overhead is negligible and riot figures.
> citation.rdf?
> (BSBM has many large literals making it "content-centric" which
> incidentally, an XML should do very well on.)
>
> What are you getting on your hardware for Jena 5.10 and your branch?
>
>
On my notebook with 13th Gen Intel(R) Core(TM) i9-13950HX   2.20 GHz and 64
GB of RAM, I get:

Benchmark
(param0_GraphUri)  (param1_ParserLang)  Mode  Cnt  Score   Error  Units
TestXMLParser.parseXML
EquipmentProfileCoreRDFSAugmented-v2_4_15-4Sep2020.rdf       RRX.RDFXML_SAX
 avgt   15  0,037 ± 0,008   s/op
TestXMLParser.parseXML
EquipmentProfileCoreRDFSAugmented-v2_4_15-4Sep2020.rdf   RRX.RDFXML_StAX_sr
 avgt   15  0,038 ± 0,008   s/op
TestXMLParser.parseXML
SteadyStateHypothesisProfileRDFSAugmented-v2_4_15-4Sep2020.rdf
RRX.RDFXML_SAX  avgt   15  0,018 ± 0,005   s/op
TestXMLParser.parseXML
SteadyStateHypothesisProfileRDFSAugmented-v2_4_15-4Sep2020.rdf
RRX.RDFXML_StAX_sr  avgt   15  0,016 ± 0,006   s/op
TestXMLParser.parseXML
StateVariableProfileRDFSAugmented-v2_4_15-4Sep2020.rdf       RRX.RDFXML_SAX
 avgt   15  0,014 ± 0,003   s/op
TestXMLParser.parseXML
StateVariableProfileRDFSAugmented-v2_4_15-4Sep2020.rdf   RRX.RDFXML_StAX_sr
 avgt   15  0,013 ± 0,006   s/op
TestXMLParser.parseXML
 TopologyProfileRDFSAugmented-v2_4_15-4Sep2020.rdf       RRX.RDFXML_SAX
 avgt   15  0,016 ± 0,005   s/op
TestXMLParser.parseXML
 TopologyProfileRDFSAugmented-v2_4_15-4Sep2020.rdf   RRX.RDFXML_StAX_sr
 avgt   15  0,015 ± 0,004   s/op
TestXMLParser.parseXML
 CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml       RRX.RDFXML_SAX
 avgt   15  0,956 ± 0,007   s/op
TestXMLParser.parseXML
 CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml   RRX.RDFXML_StAX_sr
 avgt   15  0,962 ± 0,034   s/op
TestXMLParser.parseXML
CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml       RRX.RDFXML_SAX
 avgt   15  0,143 ± 0,005   s/op
TestXMLParser.parseXML
CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml   RRX.RDFXML_StAX_sr
 avgt   15  0,142 ± 0,005   s/op
TestXMLParser.parseXML
 CGMES_v2.4.15_RealGridTestConfiguration_TP_V2.xml       RRX.RDFXML_SAX
 avgt   15  0,149 ± 0,010   s/op
TestXMLParser.parseXML
 CGMES_v2.4.15_RealGridTestConfiguration_TP_V2.xml   RRX.RDFXML_StAX_sr
 avgt   15  0,146 ± 0,005   s/op
TestXMLParser.parseXML
 CGMES_v2.4.15_RealGridTestConfiguration_SV_V2.xml       RRX.RDFXML_SAX
 avgt   15  0,122 ± 0,005   s/op
TestXMLParser.parseXML
 CGMES_v2.4.15_RealGridTestConfiguration_SV_V2.xml   RRX.RDFXML_StAX_sr
 avgt   15  0,123 ± 0,005   s/op
TestXMLParser.parseXMLJena510
 EquipmentProfileCoreRDFSAugmented-v2_4_15-4Sep2020.rdf
RRX.RDFXML_SAX  avgt   15  0,048 ± 0,006   s/op
TestXMLParser.parseXMLJena510
 EquipmentProfileCoreRDFSAugmented-v2_4_15-4Sep2020.rdf
RRX.RDFXML_StAX_sr  avgt   15  0,050 ± 0,009   s/op
TestXMLParser.parseXMLJena510
 SteadyStateHypothesisProfileRDFSAugmented-v2_4_15-4Sep2020.rdf
RRX.RDFXML_SAX  avgt   15  0,017 ± 0,005   s/op
TestXMLParser.parseXMLJena510
 SteadyStateHypothesisProfileRDFSAugmented-v2_4_15-4Sep2020.rdf
RRX.RDFXML_StAX_sr  avgt   15  0,016 ± 0,006   s/op
TestXMLParser.parseXMLJena510
 StateVariableProfileRDFSAugmented-v2_4_15-4Sep2020.rdf
RRX.RDFXML_SAX  avgt   15  0,014 ± 0,007   s/op
TestXMLParser.parseXMLJena510
 StateVariableProfileRDFSAugmented-v2_4_15-4Sep2020.rdf
RRX.RDFXML_StAX_sr  avgt   15  0,017 ± 0,003   s/op
TestXMLParser.parseXMLJena510
TopologyProfileRDFSAugmented-v2_4_15-4Sep2020.rdf       RRX.RDFXML_SAX
 avgt   15  0,016 ± 0,006   s/op
TestXMLParser.parseXMLJena510
TopologyProfileRDFSAugmented-v2_4_15-4Sep2020.rdf   RRX.RDFXML_StAX_sr
 avgt   15  0,014 ± 0,003   s/op
TestXMLParser.parseXMLJena510
CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml       RRX.RDFXML_SAX
 avgt   15  1,319 ± 0,012   s/op
TestXMLParser.parseXMLJena510
CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml   RRX.RDFXML_StAX_sr
 avgt   15  1,350 ± 0,029   s/op
TestXMLParser.parseXMLJena510
 CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml       RRX.RDFXML_SAX
 avgt   15  0,191 ± 0,005   s/op
TestXMLParser.parseXMLJena510
 CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml   RRX.RDFXML_StAX_sr
 avgt   15  0,190 ± 0,003   s/op
TestXMLParser.parseXMLJena510
CGMES_v2.4.15_RealGridTestConfiguration_TP_V2.xml       RRX.RDFXML_SAX
 avgt   15  0,243 ± 0,006   s/op
TestXMLParser.parseXMLJena510
CGMES_v2.4.15_RealGridTestConfiguration_TP_V2.xml   RRX.RDFXML_StAX_sr
 avgt   15  0,241 ± 0,006   s/op
TestXMLParser.parseXMLJena510
CGMES_v2.4.15_RealGridTestConfiguration_SV_V2.xml       RRX.RDFXML_SAX
 avgt   15  0,167 ± 0,004   s/op
TestXMLParser.parseXMLJena510
CGMES_v2.4.15_RealGridTestConfiguration_SV_V2.xml   RRX.RDFXML_StAX_sr
 avgt   15  0,166 ± 0,005   s/op

Even when I read in parallel, I have to wait for the slowes graph/profile
to be parsed, which is the EQ
with 1.319 seconds for Jena 5.1 and 0.956 seconds with the PR
<https://github.com/apache/jena/pull/2744>, which is not negligible to me.
Besides, the measured time is after warm-up and after initializing the
HttpEnv :-) and in some scenarios I can't avoid a cold start.

I was recently at the LF Energy summit where a speaker complained that in
PowSyBl <https://github.com/powsybl/>, which by the way uses RDF4J,
it takes about 7-10 seconds to import a single CGMES dataset for just one
timestamp, whereas other formats only take 1-2 seconds.
One of my goals is to strengthen the use of CIMXML as an RDF and graph
format instead of an XML format.
So I thought, it would be nice to see, if it could be done much faster in
PowSyBl with the approach, we are currently developing at work.
That is, how I came to profile where it takes so long to read the files. I
simply started at the beginning of the process.

Many of the processes used to operate the merged European grid have a time
window of around 15 minutes to a maximum of 1 hour.
(Of course, individual grid operators have to react much more quickly to
situations in their grid. That's why they have these huge screens in their
control rooms.)
Calculations that require optimisation often currently take around 20
minutes on the merged European grid.
So if all 35 individual grids (x 24 hours) arrive and we spend more than 5
minutes reading, merging and preparing the data before the operator can
even see
the status of the grids and start the calculations, then that is critical
time.

The real time control room software is one of the main reasons I am a bit
obsessed with performance in Apache Jena.
We have proven that RDF with Apache Jena works well for a big cloud-native
application, optimizing redispatch on the European grid.
Now one of our customers, a transport grid operator, wants to renew their
SCADA and control room software. Many parts of that software are now based
on a generic model-based approach using RDF. The control room software
needs to display the data almost in real-time.
That is why I constantly come up with PoCs that test the limits of what is
possible with Apache Jena.



> >
> > Benchmark  (param0_GraphUri)        (param1_ParserLang)  Mode  Cnt
> >  Score   Error  Units
> > TestXMLParser.parseXMLUsingBufferedInputStream
> > CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml     RRX.RDFXML_SAX
> >  avgt   15  1,016 ± 0,023   s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream
> > CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml RRX.RDFXML_StAX_sr
> >  avgt   15  1,025 ± 0,020   s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream
> > CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml  RRX.RDFXML_StAX2_sr
> >  avgt   15  0,913 ± 0,020   s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream
> > CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml
> >  RRX.RDFXML_StAX2_sr_aalto  avgt   15  0,913 ± 0,125 s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream
> >  CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml
> > RRX.RDFXML_SAX  avgt   15  0,128 ± 0,003   s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream
> >  CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml
> > RRX.RDFXML_StAX_sr  avgt   15  0,126 ± 0,010   s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream
> >  CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml
> >  RRX.RDFXML_StAX2_sr  avgt   15  0,123 ± 0,002   s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream
> >  CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml
> >  RRX.RDFXML_StAX2_sr_aalto  avgt   15  0,113 ± 0,003 s/op
> >
> > Benchmark           (param0_GraphUri)        (param1_ParserLang)  Mode
> >  Cnt   Score    Error  Units
> > TestXMLParser.parseXMLUsingBufferedInputStream  bsbm-5m.xml
> >   RRX.RDFXML_SAX  avgt    3  11,637 ±  5,626   s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream  bsbm-5m.xml
> > RRX.RDFXML_StAX_sr  avgt    3  11,381 ±  0,302   s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream  bsbm-5m.xml
> >  RRX.RDFXML_StAX2_sr  avgt    3  10,713 ±  3,734   s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream  bsbm-5m.xml
> >  RRX.RDFXML_StAX2_sr_aalto  avgt    3   9,782 ±  1,532   s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream  citations.rdf
> >     RRX.RDFXML_SAX  avgt    3  52,060 ± 17,449   s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream  citations.rdf
> > RRX.RDFXML_StAX_sr  avgt    3  52,267 ±  3,415   s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream  citations.rdf
> >  RRX.RDFXML_StAX2_sr  avgt    3  52,982 ±  6,149   s/op
> > TestXMLParser.parseXMLUsingBufferedInputStream  citations.rdf
> >  RRX.RDFXML_StAX2_sr_aalto  avgt    3  46,060 ±  6,228   s/op
> >
> > Are there any tests for ParserProfile implementations or the
> > integration with the parsers? I could not find any.
> Not as test ParserProfile directly - there is TestFactoryRDF, FactoryRDF
> is doing most of the work.
> Every parser has one or more eval test suites (Jena and also W3C) that
> uses ParserProfile extensively because that's the purpose of ParserProfile.
>
> > Next, I will try some caching of nodes and possibly try to integrate
> > https://github.com/afs/x4ld/tree/main/iri4ld.
> > Do you have any other ideas or suggestions on how to proceed?
> I'll write about iri4ld soon.
>
> The main reason for iri4ld is (un)maintainability of jena-iri but it is
> much more lightweight maybe to the point where caches have little or
> negative effect.
>

I am looking forward to going to iri4ld. Maybe we can remove the cache
then.
That´s one reason why I added the JMH benchmark test as part of the PR.


> >
> >   Arne
> >
> > Am Di., 24. Sept. 2024 um 11:18 Uhr schrieb Andy Seaborne
> > <a...@apache.org>:
> >
> >
> >
> >     On 23/09/2024 16:28, Arne Bernhardt wrote:
> >     > Hello,
> >     > I have been trying to speed up the parsing of RDF/XML in Apache
> >     Jena.
> >     >
> >     > The reason is that our customers are introducing new data
> >     formats. This
> >     > involves replacing small (<1MB) ASCII files originating from
> >     punch cards
> >     > with large (>40MB) RDF/XML formats. However, the expectation is
> >     that the
> >     > processing speed will not increase. ;-)
> >
> >     Can you provide a copy of the test data?
> >
> >     Performance can be sensitive to XML shape.
> >
> >     Of the RRX parsers, StaX/EV is the slowest. It was written first
> >     and it
> >     is faster than ARP but not as much as I'd hoped.
> >
> >     Stax/SR is a fairly simple conversion of the event code to use
> >     StreamReader is faster than EV.
> >
> >     SAX is fastest sometimes, and sometimes it's Stax/SR.
> >
> >     TBH Once it was much faster than ARP, I didn't push on the
> >     performance
> >     as much.
> >
> >
> >     For testing I was using RDF from Uniport - citations.rdf has many
> >     long
> >     strings (literals of several lines of text). 3.6G uncompressed. 30e6
> >     triples.
> >
> >
> https://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/citations.rdf.xz
> >
> >     Figures as of yesterday:
> >     Files on an SSD.
> >     Running of the development tree (getting class files from jars
> >     makes a
> >     measurable difference).
> >
> >     riot --syntax arp1 --time --count --sink citations.rdf
> >     citations.rdf   : Triples = 30,358,427
> >     citations.rdf   : 61.17 sec : 30,358,427 Triples : 496,287.90 per
> >     second
> >
> >     riot --syntax rrxstaxev --time --count --sink citations.rdf
> >     citations.rdf   : Triples = 30,358,427
> >     citations.rdf   : 52.22 sec : 30,358,427 Triples : 581,411.99 per
> >     second
> >
> >     riot --syntax rrxstaxsr --time --count --sink citations.rdf
> >     citations.rdf   : Triples = 30,358,427
> >     citations.rdf   : 31.86 sec : 30,358,427 Triples : 952,929.47 per
> >     second
> >
> >     riot --syntax rrxsax --time --count --sink citations.rdf
> >     citations.rdf   : Triples = 30,358,427
> >     citations.rdf   : 32.32 sec : 30,358,427 Triples : 939,220.59 per
> >     second
> >
> >     > The first thing I tried was to integrate other parsers than Xerces.
> >     > But when I added the dependencies, also the unchanged
> >     riot/lang/rdfxml
> >     > implementations started to use either woodstox or aalto.
> >     > The reason was in org.apache.jena.riot.lang.rdfxml.SysRRX, where "
> >     > XMLInputFactory.newInstance();" is used. To continue my
> >     comparisons, I used
> >     > "XMLInputFactory.newDefaultFactory()" instead.
> >     > --> So, when I bundle Jena with another application where aalto is
> a
> >     > dependency, aalto may be used?
> >
> >     As I understand it, "XMLInputFactory.newInstance()" returned an
> >     instance
> >     from the system configured parser and
> >     "XMLInputFactory.newDefaultFactory()" is always the built-in
> >     XMLInputFactoryImpl.
> >
> >     Parsers are configured to be safe via setup in JenaXMLInput
> >
> >     Yes, other parsers can be used.
> >
> >     Should Jena always use the built-in one?
> >     SysRRX could have a function that passes in the factor generator to
> >     allow an independent choice.
> >
> >     That would allow Jena provide RDF/XML with one parser yet allow
> >     the app
> >     to have XML processing elsewhere with a different choice of XML
> >     parser.
> >
> >     > If this is potentially dangerous, as mentioned in
> >     > https://issues.apache.org/jira/browse/JENA-2331, shouldn´t we use
> "
> >     > XMLInputFactory.newDefaultFactory()" to ensure, that no unknown 3rd
> >     > party parser is used?
> >
> >     JENA-2331 is fixed, I hope.
> >     The setup should work with woodstox.
> >     (the code JenaXMLInput.initXMLInputFactory pokes around and should
> >     not
> >     set woodstox unsupported properties)
> >
> >     Is there still a problem?
> >
> >     > In my benchmarks, woodstox is a bit faster than the default, but
> >     with
> >     > larger files aalto is even faster. Unfortunately aalto seems to
> >     be almost
> >     > inactive.
> >
> >     What speeds were you getting?
> >
> >     > Do you have any suggestions on fast xml parsers or is there any
> >     work on a
> >     > faster rdf/xml parser?
> >     >
> >     >    Arne
> >     >
> >
>

Re: Fast RDF/XML parsing - woodstox, aalto and alternatives

Reply via email to