LorenzBuehmann commented on issue #1633:
URL: https://github.com/apache/jena/issues/1633#issuecomment-1328697603
Thanks for advice. We stumbled upon this need when trying to export a larger
subset of loaded data.
Some facts:
Dataset: `257 288 501` triples loaded into TDB2 consuming 52GB disk space
Size of subset: `196 423 885` triples resulting in 26GB N-Triples files
Using `tdb2.tdbquery`
with 32GB we got an OOM after 22min
```
JVM_ARGS="-Xmx32G" tdb2.tdbquery --loc tdb2/siren --query subset.rq
--results=N-Triples > subset.nt
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at
org.apache.jena.mem.HashedBunchMap.newKeyArray(HashedBunchMap.java:39)
at org.apache.jena.mem.HashedBunchMap.grow(HashedBunchMap.java:99)
at org.apache.jena.mem.HashedBunchMap.put$(HashedBunchMap.java:90)
at org.apache.jena.mem.HashedBunchMap.put(HashedBunchMap.java:70)
at
org.apache.jena.mem.NodeToTriplesMapMem.add(NodeToTriplesMapMem.java:51)
at
org.apache.jena.mem.GraphTripleStoreBase.add(GraphTripleStoreBase.java:60)
at org.apache.jena.mem.GraphMem.performAdd(GraphMem.java:42)
at org.apache.jena.graph.impl.GraphBase.add(GraphBase.java:169)
at org.apache.jena.sparql.graph.GraphOps.addAll(GraphOps.java:75)
at
org.apache.jena.sparql.exec.QueryExecDataset.construct(QueryExecDataset.java:187)
at
org.apache.jena.sparql.exec.QueryExec.construct(QueryExec.java:111)
at
org.apache.jena.sparql.exec.QueryExecutionAdapter.execConstruct(QueryExecutionAdapter.java:122)
at
org.apache.jena.sparql.exec.QueryExecutionCompat.execConstruct(QueryExecutionCompat.java:105)
at
org.apache.jena.sparql.util.QueryExecUtils.doConstructQuery(QueryExecUtils.java:197)
at
org.apache.jena.sparql.util.QueryExecUtils.executeQuery(QueryExecUtils.java:113)
at arq.query.lambda$queryExec$0(query.java:237)
at arq.query$$Lambda$188/0x00007fb183cfd168.run(Unknown Source)
at org.apache.jena.system.Txn.exec(Txn.java:77)
at org.apache.jena.system.Txn.executeRead(Txn.java:115)
at arq.query.queryExec(query.java:234)
at arq.query.exec(query.java:157)
at org.apache.jena.cmd.CmdMain.mainMethod(CmdMain.java:87)
at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:56)
at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:43)
at tdb2.tdbquery.main(tdbquery.java:30)
```
with 64GB assigned it worked in 26min.
Taking the advice from Andy into account, I combined `SELECT REDUCED` with
TARQL:
```
tdb2.tdbquery --loc tdb2/siren --query subset_select.rq --results=CSV |
../ukch/tarql-1.2/bin/tarql --ntriples --stdin subset_template.tarql subset.csv
> tarql_dump.nt
```
that works without increasing the memory and produces a 31GB N-Triples file
containing `235 632 534` triples, i.e. there are lots of duplicates. So, for
TARQL you can basically reuse the `CONSTRUCT` template but have to keep in mind
to recreate the IRIs and bind them to new variables. But it works and would be
the only option on my laptop for example
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]