rvesse opened a new issue, #2400:
URL: https://github.com/apache/jena/issues/2400
### Version
5.0.0
### What happened?
Was refactoring some code that uses `RDFParserBuilder` and noticed it was
running much slower than previously, on comparing the old vs new code noticed
the difference was I'd starting using `toDatasetGraph()` instead of creating my
own `DatasetGraph` and then calling `parse()`
For example consider the following test cases:
```java
@Test
public void parse_to_dataset_01() {
RDFParserBuilder builder =
RDFParserBuilder.create().lang(Lang.NQUADS).fromString(testLarge);
DatasetGraph dsg = builder.toDatasetGraph();
assertEquals(dsg.stream().count(), LARGE_SIZE);
}
@Test
public void parse_to_dataset_02() {
RDFParserBuilder builder =
RDFParserBuilder.create().lang(Lang.NQUADS).fromString(testLarge);
DatasetGraph dsg = DatasetGraphFactory.create();
builder.parse(dsg);
assertEquals(dsg.stream().count(), LARGE_SIZE);
}
@Test
public void parse_to_dataset_03() {
RDFParserBuilder builder =
RDFParserBuilder.create().lang(Lang.NQUADS).fromString(testLarge);
DatasetGraph dsg = DatasetGraphFactory.createTxnMem();
dsg.executeWrite(() -> builder.parse(dsg));
assertEquals(dsg.stream().count(), LARGE_SIZE);
}
```
Where `testLarge` is simply a generated string containing a sufficient
number of quads to illustrate the performance difference (10,000 quads is a
good minimum, 100k shows more obvious difference)
With 100k quads the first test takes ~1s, the 2nd ~200ms and the 3rd ~500ms

As the tests demonstrate the difference seems to be down to the use of
transactions. Calling `toDatasetGraph()` creates a transactional dataset but
then doesn't use transactions on it meaning all the writes from parsing are
treated as individual commits AFAICT. Adding an explicit transaction as in the
3rd test yields a 2x speed up.
In my code the use of a transactional dataset was actually unnecessary so
calling `parse()` with my desired `DatasetGraph` implementation was the
solution.
Possible solutions:
1. Have Jena use the fresh transactional dataset in a write transaction
inside `RDFParser.toDatasetGraph()` to reduce the performance hit
2. Have `StreamRDFLib.dataset()` automatically start and stop a transaction
in
3. Document the `toDatasetGraph()` behaviour and its potential pitfalls more
clearly
2 is my preferred solution as it offers improved parsing performance without
impacting potential other uses of the more general `StreamRDFLib.dataset()` API
### Relevant output and stacktrace
_No response_
### Are you interested in making a pull request?
Yes
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]