[jira] [Updated] (JENA-1848) Trig Writer slow; doesn't scale to many graphs

Claus Stadler (Jira) Fri, 21 Feb 2020 19:48:14 -0800


     [ 
https://issues.apache.org/jira/browse/JENA-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Claus Stadler updated JENA-1848:
--------------------------------
    Description: 
The following code for loading 1.000.000 graphs takes 1 minute on my notebook, 
but I stopped my attempt of writing the data out as trig after several hours.
{code:java}
Dataset ds = RDFDataMgr.loadDataset("test-data.trig");
RDFDataMgr.write(new NullOutputStream(), ds, RDFFormat.TRIG_PRETTY);
{code}

In comparison, writing takes 2 seconds for me with RDFFormat.NQUADS.

The test data I used can be generated with this *gendata.sh* bash script:
{code:bash}
#!/bin/bash
MAX=${1:-10}
echo "@prefix eg: <http://www.example.org/> ."
for i in `seq 1 $MAX`; do
  echo "<urn:graph-$i> { <urn:s-$i> eg:idx $i }"
done
{code}

Invoke the script with the number of named graphs to generate, in my case I used
{code:bash}
./gendata.sh 1000000 > test-data.trig`
{code}

With the profiler I could trace the problem to code in *TurtleShell.java* which 
repeatedly collects all one million graph names :

{code:java}
this.graphNames = (dsg != null) ? Iter.toSet(dsg.listGraphNodes()) : null ;`
{code}

https://github.com/apache/jena/blob/2a13a9c633f1c8661c1a446a9d98819391c09477/jena-arq/src/main/java/org/apache/jena/riot/writer/TurtleShell.java#L185



  was:
The following code for loading 1.000.000 graphs takes 1 minute on my notebook, 
but I stopped my attempt of writing the data out as trig after several hours.
{code:java}
Dataset ds = RDFDataMgr.loadDataset("test-data.trig");
RDFDataMgr.write(new NullOutputStream(), ds, RDFFormat.TRIG_PRETTY);
{code}

In comparison, writing takes 2 seconds for me with RDFFormat.NQUADS.

The test data I used can be generated with this *gendata.sh* bash script:
{code:bash}
#!/bin/bash
MAX=${1:-10}
echo "@prefix eg: <http://www.example.org/> ."
for i in `seq 1 $MAX`; do
  echo "<urn:graph-$i> { <urn:s-$i> eg:idx $i }"
done
{code}

Invoke the script the number of named graphs to generate, in my case I used
{code:bash}
./gendata.sh 1000000 > test-data.trig`
{code}

With the profiler I could trace the problem to code in *TurtleShell.java* which 
repeatedly collects all one million graph names :

{code:java}
this.graphNames = (dsg != null) ? Iter.toSet(dsg.listGraphNodes()) : null ;`
{code}

https://github.com/apache/jena/blob/2a13a9c633f1c8661c1a446a9d98819391c09477/jena-arq/src/main/java/org/apache/jena/riot/writer/TurtleShell.java#L185




> Trig Writer slow; doesn't scale to many graphs
> ----------------------------------------------
>
>                 Key: JENA-1848
>                 URL: https://issues.apache.org/jira/browse/JENA-1848
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: RIOT
>    Affects Versions: Jena 3.14.0
>            Reporter: Claus Stadler
>            Priority: Major
>
> The following code for loading 1.000.000 graphs takes 1 minute on my 
> notebook, but I stopped my attempt of writing the data out as trig after 
> several hours.
> {code:java}
> Dataset ds = RDFDataMgr.loadDataset("test-data.trig");
> RDFDataMgr.write(new NullOutputStream(), ds, RDFFormat.TRIG_PRETTY);
> {code}
> In comparison, writing takes 2 seconds for me with RDFFormat.NQUADS.
> The test data I used can be generated with this *gendata.sh* bash script:
> {code:bash}
> #!/bin/bash
> MAX=${1:-10}
> echo "@prefix eg: <http://www.example.org/> ."
> for i in `seq 1 $MAX`; do
>   echo "<urn:graph-$i> { <urn:s-$i> eg:idx $i }"
> done
> {code}
> Invoke the script with the number of named graphs to generate, in my case I 
> used
> {code:bash}
> ./gendata.sh 1000000 > test-data.trig`
> {code}
> With the profiler I could trace the problem to code in *TurtleShell.java* 
> which repeatedly collects all one million graph names :
> {code:java}
> this.graphNames = (dsg != null) ? Iter.toSet(dsg.listGraphNodes()) : null ;`
> {code}
> https://github.com/apache/jena/blob/2a13a9c633f1c8661c1a446a9d98819391c09477/jena-arq/src/main/java/org/apache/jena/riot/writer/TurtleShell.java#L185



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (JENA-1848) Trig Writer slow; doesn't scale to many graphs

Reply via email to