Rob Vesse created JENA-813:
------------------------------
Summary: In-memory DatasetGraph is extremely slow to iterate over
when lots of named graphs are used
Key: JENA-813
URL: https://issues.apache.org/jira/browse/JENA-813
Project: Apache Jena
Issue Type: Bug
Components: ARQ
Affects Versions: Jena 2.12.1
Reporter: Rob Vesse
Discovered this accidentally in the course of debugging a test case which
seemed to running exceptionally slow. Basically the in-memory {{DatasetGraph}}
implementation will perform extremely poorly when trying to iterate all quads
it contains.
This problem manifests at relatively small scales and shows a marked difference
between small numbers of quads in a single graph versus the quads in individual
graphs. In my testing with only a few thousand quads placing them in a single
graph versus many individual graphs can see 5/6 orders of magnitude performance
difference.
I will attach code to reproduce and show my log output from running this on my
system (Mac OS X Mavericks - Quad Core - 8GB RAM):
{noformat}
0 [main] INFO org.apache.jena.playground.SlowDatasetGraph - Testing with 1000
quads
31 [main] DEBUG com.hp.hpl.jena.shared.LockMRSW - Lock : main
44 [main] INFO org.apache.jena.playground.SlowDatasetGraph - DatasetGraph
implementation com.hp.hpl.jena.sparql.core.DatasetGraphMaker has 1000 quads in
1 graphs
64 [main] INFO org.apache.jena.playground.SlowDatasetGraph - Took 0.009692
seconds to iterate over 1000 quads in 1 graphs
64 [main] DEBUG com.hp.hpl.jena.shared.LockMRSW - Lock : main
85 [main] INFO org.apache.jena.playground.SlowDatasetGraph - DatasetGraph
implementation com.hp.hpl.jena.sparql.core.DatasetGraphMaker has 1000 quads in
1000 graphs
779 [main] INFO org.apache.jena.playground.SlowDatasetGraph - Took 0.693548
seconds to iterate over 1000 quads in 1000 graphs
779 [main] INFO org.apache.jena.playground.SlowDatasetGraph - Testing with
2000 quads
779 [main] DEBUG com.hp.hpl.jena.shared.LockMRSW - Lock : main
788 [main] INFO org.apache.jena.playground.SlowDatasetGraph - DatasetGraph
implementation com.hp.hpl.jena.sparql.core.DatasetGraphMaker has 2000 quads in
1 graphs
788 [main] INFO org.apache.jena.playground.SlowDatasetGraph - Took 0.000047
seconds to iterate over 2000 quads in 1 graphs
788 [main] DEBUG com.hp.hpl.jena.shared.LockMRSW - Lock : main
821 [main] INFO org.apache.jena.playground.SlowDatasetGraph - DatasetGraph
implementation com.hp.hpl.jena.sparql.core.DatasetGraphMaker has 2000 quads in
2000 graphs
7735 [main] INFO org.apache.jena.playground.SlowDatasetGraph - Took 6.914044
seconds to iterate over 2000 quads in 2000 graphs
7735 [main] INFO org.apache.jena.playground.SlowDatasetGraph - Testing with
3000 quads
7735 [main] DEBUG com.hp.hpl.jena.shared.LockMRSW - Lock : main
7743 [main] INFO org.apache.jena.playground.SlowDatasetGraph - DatasetGraph
implementation com.hp.hpl.jena.sparql.core.DatasetGraphMaker has 3000 quads in
1 graphs
7743 [main] INFO org.apache.jena.playground.SlowDatasetGraph - Took 0.000032
seconds to iterate over 3000 quads in 1 graphs
7743 [main] DEBUG com.hp.hpl.jena.shared.LockMRSW - Lock : main
7767 [main] INFO org.apache.jena.playground.SlowDatasetGraph - DatasetGraph
implementation com.hp.hpl.jena.sparql.core.DatasetGraphMaker has 3000 quads in
3000 graphs
36233 [main] INFO org.apache.jena.playground.SlowDatasetGraph - Took 28.465885
seconds to iterate over 3000 quads in 3000 graphs
36233 [main] INFO org.apache.jena.playground.SlowDatasetGraph - Testing with
4000 quads
36233 [main] DEBUG com.hp.hpl.jena.shared.LockMRSW - Lock : main
36234 [main] INFO org.apache.jena.playground.SlowDatasetGraph - DatasetGraph
implementation com.hp.hpl.jena.sparql.core.DatasetGraphMaker has 4000 quads in
1 graphs
36235 [main] INFO org.apache.jena.playground.SlowDatasetGraph - Took 0.000033
seconds to iterate over 4000 quads in 1 graphs
36235 [main] DEBUG com.hp.hpl.jena.shared.LockMRSW - Lock : main
36253 [main] INFO org.apache.jena.playground.SlowDatasetGraph - DatasetGraph
implementation com.hp.hpl.jena.sparql.core.DatasetGraphMaker has 4000 quads in
4000 graphs
109646 [main] INFO org.apache.jena.playground.SlowDatasetGraph - Took
73.393265 seconds to iterate over 4000 quads in 4000 graphs
109646 [main] INFO org.apache.jena.playground.SlowDatasetGraph - Testing with
5000 quads
109646 [main] DEBUG com.hp.hpl.jena.shared.LockMRSW - Lock : main
109648 [main] INFO org.apache.jena.playground.SlowDatasetGraph - DatasetGraph
implementation com.hp.hpl.jena.sparql.core.DatasetGraphMaker has 5000 quads in
1 graphs
109648 [main] INFO org.apache.jena.playground.SlowDatasetGraph - Took 0.000066
seconds to iterate over 5000 quads in 1 graphs
109648 [main] DEBUG com.hp.hpl.jena.shared.LockMRSW - Lock : main
109654 [main] INFO org.apache.jena.playground.SlowDatasetGraph - DatasetGraph
implementation com.hp.hpl.jena.sparql.core.DatasetGraphMaker has 5000 quads in
5000 graphs
249867 [main] INFO org.apache.jena.playground.SlowDatasetGraph - Took
140.212082 seconds to iterate over 5000 quads in 5000 graphs
{noformat}
>From debugging my best guess is this is down to how {{DatasetGraphCollection}}
>(and it's hierarchy) build the quads iterator by iterating over the graph
>names and then iterating over the individual graphs appending the iterators
>together. Something in the structuring means accessing and appending all
>those individual iterators is drastically slow compared to iterating a single
>graph.
A related bug is that given a sufficiently large number of graphs attempts to
iterate will fail with a {{StackOverflowError}}, the culprit in this case is
{{IteratorCons}}.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)