[I] RDFParserBuilder.toDatasetGraph() is much slower than calling RDFParserBuilder.parse(DatasetGraph) [jena]

via GitHub Mon, 08 Apr 2024 07:44:01 -0700


rvesse opened a new issue, #2400:
URL: https://github.com/apache/jena/issues/2400


   ### Version
   
   5.0.0
   
   ### What happened?
   
   Was refactoring some code that uses `RDFParserBuilder` and noticed it was 
running much slower than previously, on comparing the old vs new code noticed 
the difference was I'd starting using `toDatasetGraph()` instead of creating my 
own `DatasetGraph` and then calling `parse()`
   
   For example consider the following test cases:
   
   ```java
       @Test
       public void parse_to_dataset_01() {
           RDFParserBuilder builder = 
RDFParserBuilder.create().lang(Lang.NQUADS).fromString(testLarge);
           DatasetGraph dsg = builder.toDatasetGraph();
           assertEquals(dsg.stream().count(), LARGE_SIZE);
       }
   
       @Test
       public void parse_to_dataset_02() {
           RDFParserBuilder builder = 
RDFParserBuilder.create().lang(Lang.NQUADS).fromString(testLarge);
           DatasetGraph dsg = DatasetGraphFactory.create();
           builder.parse(dsg);
           assertEquals(dsg.stream().count(), LARGE_SIZE);
       }
   
       @Test
       public void parse_to_dataset_03() {
           RDFParserBuilder builder = 
RDFParserBuilder.create().lang(Lang.NQUADS).fromString(testLarge);
           DatasetGraph dsg = DatasetGraphFactory.createTxnMem();
           dsg.executeWrite(() -> builder.parse(dsg));
           assertEquals(dsg.stream().count(), LARGE_SIZE);
       }
   ```
   
   Where `testLarge` is simply a generated string containing a sufficient 
number of quads to illustrate the performance difference (10,000 quads is a 
good minimum, 100k shows more obvious difference)
   
   With 100k quads the first test takes ~1s, the 2nd ~200ms and the 3rd ~500ms
   
   ![Screenshot 2024-04-08 at 15 37 
51](https://github.com/apache/jena/assets/2104864/91ba1ffa-d872-4b9e-9485-69152d75b4e2)
   
   As the tests demonstrate the difference seems to be down to the use of 
transactions.  Calling `toDatasetGraph()` creates a transactional dataset but 
then doesn't use transactions on it meaning all the writes from parsing are 
treated as individual commits AFAICT.  Adding an explicit transaction as in the 
3rd test yields a 2x speed up.
   
   In my code the use of a transactional dataset was actually unnecessary so 
calling `parse()` with my desired `DatasetGraph` implementation was the 
solution.
   
   Possible solutions:
   
   1. Have Jena use the fresh transactional dataset in a write transaction 
inside `RDFParser.toDatasetGraph()` to reduce the performance hit
   2. Have `StreamRDFLib.dataset()` automatically start and stop a transaction 
in 
   3. Document the `toDatasetGraph()` behaviour and its potential pitfalls more 
clearly
   
   2 is my preferred solution as it offers improved parsing performance without 
impacting potential other uses of the more general `StreamRDFLib.dataset()` API
   
   ### Relevant output and stacktrace
   
   _No response_
   
   ### Are you interested in making a pull request?
   
   Yes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] RDFParserBuilder.toDatasetGraph() is much slower than calling RDFParserBuilder.parse(DatasetGraph) [jena]

Reply via email to