[GitHub] [jena] LorenzBuehmann commented on issue #1633: optional streaming construct?

GitBox Mon, 28 Nov 2022 00:16:28 -0800


LorenzBuehmann commented on issue #1633:
URL: https://github.com/apache/jena/issues/1633#issuecomment-1328697603


   Thanks for advice. We stumbled upon this need when trying to export a larger 
subset of loaded data.
   Some facts:
   
   Dataset: `257 288 501` triples loaded into TDB2 consuming 52GB disk space
   Size of subset: `196 423 885` triples resulting in 26GB N-Triples files
   Using `tdb2.tdbquery`
   
   with 32GB we got an OOM after 22min
   ```
   JVM_ARGS="-Xmx32G" tdb2.tdbquery --loc tdb2/siren --query subset.rq 
--results=N-Triples > subset.nt
   Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
           at 
org.apache.jena.mem.HashedBunchMap.newKeyArray(HashedBunchMap.java:39)
           at org.apache.jena.mem.HashedBunchMap.grow(HashedBunchMap.java:99)
           at org.apache.jena.mem.HashedBunchMap.put$(HashedBunchMap.java:90)
           at org.apache.jena.mem.HashedBunchMap.put(HashedBunchMap.java:70)
           at 
org.apache.jena.mem.NodeToTriplesMapMem.add(NodeToTriplesMapMem.java:51)
           at 
org.apache.jena.mem.GraphTripleStoreBase.add(GraphTripleStoreBase.java:60)
           at org.apache.jena.mem.GraphMem.performAdd(GraphMem.java:42)
           at org.apache.jena.graph.impl.GraphBase.add(GraphBase.java:169)
           at org.apache.jena.sparql.graph.GraphOps.addAll(GraphOps.java:75)
           at 
org.apache.jena.sparql.exec.QueryExecDataset.construct(QueryExecDataset.java:187)
           at 
org.apache.jena.sparql.exec.QueryExec.construct(QueryExec.java:111)
           at 
org.apache.jena.sparql.exec.QueryExecutionAdapter.execConstruct(QueryExecutionAdapter.java:122)
           at 
org.apache.jena.sparql.exec.QueryExecutionCompat.execConstruct(QueryExecutionCompat.java:105)
           at 
org.apache.jena.sparql.util.QueryExecUtils.doConstructQuery(QueryExecUtils.java:197)
           at 
org.apache.jena.sparql.util.QueryExecUtils.executeQuery(QueryExecUtils.java:113)
           at arq.query.lambda$queryExec$0(query.java:237)
           at arq.query$$Lambda$188/0x00007fb183cfd168.run(Unknown Source)
           at org.apache.jena.system.Txn.exec(Txn.java:77)
           at org.apache.jena.system.Txn.executeRead(Txn.java:115)
           at arq.query.queryExec(query.java:234)
           at arq.query.exec(query.java:157)
           at org.apache.jena.cmd.CmdMain.mainMethod(CmdMain.java:87)
           at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:56)
           at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:43)
           at tdb2.tdbquery.main(tdbquery.java:30)
   ```
   with 64GB assigned it worked in 26min.
   
   Taking the advice from Andy into account, I combined `SELECT REDUCED` with 
TARQL:
   ```
   tdb2.tdbquery --loc tdb2/siren --query subset_select.rq --results=CSV | 
../ukch/tarql-1.2/bin/tarql --ntriples --stdin subset_template.tarql subset.csv 
> tarql_dump.nt
   ```
   that works without increasing the memory and produces a 31GB N-Triples file 
containing `235 632 534` triples, i.e. there are lots of duplicates. So, for 
TARQL you can basically reuse the `CONSTRUCT` template but have to keep in mind 
to recreate the IRIs and bind them to new variables. But it works and would be 
the only option on my laptop for example


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [jena] LorenzBuehmann commented on issue #1633: optional streaming construct?

Reply via email to