eltonfss opened a new issue, #1735:
URL: https://github.com/apache/jena/issues/1735

   ### Version
   
   4.4.0
   
   ### Question
   
   This question has been also published at:
   - Stack Overflow: 
https://stackoverflow.com/questions/75264889/why-does-the-ospg-dat-file-grows-so-much-more-than-all-other-files
   - Mailing List: 
https://lists.apache.org/thread/jxcfhkly7781k8hnw2qdy09fbj3xych8
   
   # Scenario Description (Context)
   
   I'm running Jena Fuseki Version 4.4.0 as a container on an OpenShift Cluster.
   ```
   OS Version Info (cat /etc/os-release):
   NAME="Red Hat Enterprise Linux"
   VERSION="8.5 (Ootpa)"
   ID="rhel"
   ID_LIKE="fedora" ="8.5"
   ...
   ```
   
   Hardware Info (from Jena Fuseki initialization log):
   ```
   [2023-01-27 20:08:59] Server INFO Memory: 32.0 GiB
   [2023-01-27 20:08:59] Server INFO Java: 11.0.14.1
   [2023-01-27 20:08:59] Server INFO OS: Linux 3.10.0-1160.76.1.el7.x86_64 amd64
   [2023-01-27 20:08:59] Server INFO PID: 1
   ```
   
   Disk Info (df -h):
   ```
   Filesystem Size Used Avail Use% Mounted on
   overlay 99G 76G 18G 82% /
   tmpfs 64M 0 64M 0% /dev
   tmpfs 63G 0 63G 0% /sys/fs/cgroup
   shm 64M 0 64M 0% /dev/shm
   /dev/mapper/docker_data 99G 76G 18G 82% /config
   /data 1.0T 677G 348G 67% /usr/app/run
   tmpfs 40G 24K 40G 1%
   ```
   
   My dataset is built using TDB2, and currently has the following RDF Stats:
   · Triples: 65KK (Approximately 65 million)
   · Subjects: ~20KK (Aproximately 20 million)
   · Objects: ~8KK (Aproximately 8 million)
   · Graphs: ~213K (Aproximately 213 thousand)
   · Predicates: 153
   
   The files corresponding to this dataset alone on disk sum up to 
approximately 671GB (measured with du -h). From these, the largest files are:
   ```
   · /usr/app/run/databases/my-dataset/Data-0001/OSPG.dat: 243GB
   · /usr/app/run/databases/my-dataset/Data-0001/nodes.dat: 76GB
   · /usr/app/run/databases/my-dataset/Data-0001/POSG.dat: 35GB
   · /usr/app/run/databases/my-dataset/Data-0001/nodes.idn: 33GB
   · /usr/app/run/databases/my-dataset/Data-0001/POSG.idn: 29GB
   · /usr/app/run/databases/my-dataset/Data-0001/OSPG.idn: 27GB
   ```
   
   # Main Questions
   - Q1: I've been using Jena for quite some time now and I'm well aware that 
its indexes grow significantly during usage, specially when triples are being 
added across multiple requests (transactional workloads). What is the main 
reason for this? Are the indexes being replicated somehow?
   - Q2: I've looked into several documentation pages, source code, forums, ... 
nowhere I was able to find some explanation to why OSPG.dat is so much larger 
than all other files. Is there a reasonable explanation for this based on the 
content of the dataset or the way it was generated? 
   - Q2: Could this be an indexing bug within TDB2? Should it be solved by 
upgrading to Jena 4.7.0.
   
   # Appendix
   
   ## Assembler configuration for my dataset:
   ```
   @prefix : http://base/# .
   @prefix fuseki: http://jena.apache.org/fuseki# .
   @prefix ja: http://jena.hpl.hp.com/2005/11/Assembler# .
   @prefix rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# .
   @prefix rdfs: http://www.w3.org/2000/01/rdf-schema# .
   @prefix root: http://dev-test-jena-fuseki/$/datasets .
   @prefix tdb2: http://jena.apache.org/2016/tdb# .
   
   tdb2:GraphTDB rdfs:subClassOf ja:Model .
   
   ja:ModelRDFS rdfs:subClassOf ja:Model .
   
   ja:RDFDatasetSink rdfs:subClassOf ja:RDFDataset .
   
   http://jena.hpl.hp.com/2008/tdb#DatasetTDB
   rdfs:subClassOf ja:RDFDataset .
   
   tdb2:GraphTDB2 rdfs:subClassOf ja:Model .
   
   http://jena.apache.org/text#TextDataset
   rdfs:subClassOf ja:RDFDataset .
   
   ja:RDFDatasetZero rdfs:subClassOf ja:RDFDataset .
   
   :service_tdb_my-dataset
   rdf:type fuseki:Service ;
   rdfs:label "TDB my-dataset" ;
   fuseki:dataset :ds_my-dataset ;
   fuseki:name "my-dataset" ;
   fuseki:serviceQuery "sparql" , "query" ;
   fuseki:serviceReadGraphStore "get" ;
   fuseki:serviceReadWriteGraphStore
   "data" ;
   fuseki:serviceUpdate "update" ;
   fuseki:serviceUpload "upload" .
   
   ja:ViewGraph rdfs:subClassOf ja:Model .
   
   ja:GraphRDFS rdfs:subClassOf ja:Model .
   
   tdb2:DatasetTDB rdfs:subClassOf ja:RDFDataset .
   
   http://jena.hpl.hp.com/2008/tdb#GraphTDB
   rdfs:subClassOf ja:Model .
   
   ja:DatasetTxnMem rdfs:subClassOf ja:RDFDataset .
   
   tdb2:DatasetTDB2 rdfs:subClassOf ja:RDFDataset .
   
   ja:RDFDatasetOne rdfs:subClassOf ja:RDFDataset .
   
   ja:MemoryDataset rdfs:subClassOf ja:RDFDataset .
   
   ja:DatasetRDFS rdfs:subClassOf ja:RDFDataset .
   
   :ds_my-dataset rdf:type tdb2:DatasetTDB2 ;
   tdb2:location "run/databases/my-dataset" ;
   tdb2:unionDefaultGraph true ;
   ja:context \[ ja:cxtName "arq:optFilterPlacement" ;
   ja:cxtValue "false"
   \] .
   ```
   
   ##  My Dataset Compression experiment
   
   After getting some feedbacks from Jena support through the Mailing List, 
I've tried to run two compression strategies on this dataset to see which one 
would work best. The one I'm referring to as "official" is the one that uses 
the "/$/compact" endpoint and the one I'm referring to as "unofficial" is the 
one where I create an NQuads backup and upload it to a new dataset using the 
TDBLoader. The reason I attempted this second strategy is because a 
StackOverflow post suggested that it could be significantly more efficient than 
the "official" strategy 
(https://stackoverflow.com/questions/60501386/compacting-a-dataset-in-apache-jena-fuseki/60631699#60631699).
   
   Here is a summary of the results I've obtained with both compression 
strategies (in markdown notation):
   
   ### Original Dataset
   
   RDF Stats:
   - Triples: 65222513 (Approximately 65 million)
   - Subjects: 20434264 (Aproximately 20 million)
   - Objects: 8565221 (Aproximately 8 million)
   - Graphs: 213531 (Aproximately 213 thousand)
   - Predicates: 153
   
   Disk Stats:
   - my-dataset/Data-0001: 671GB
   - my-dataset/Data-0001/OSPG.dat: 243GB
   - my-dataset/Data-0001/nodes.dat: 76GB
   - my-dataset/Data-0001/POSG.dat: 35GB
   - my-dataset/Data-0001/nodes.idn: 33GB
   - my-dataset/Data-0001/POSG.idn: 29GB
   - my-dataset/Data-0001/OSPG.idn: 27GB
   - ...
   
   ### Dataset Replica ("unofficial" compression strategy)
   
   Description: Backed up dataset as NQuads and Restore it as a new dataset 
with TDBLoader.
   
   References:
   - https://jena.apache.org/documentation/tdb2/tdb2_admin.html#backup
   - https://jena.apache.org/documentation/tdb2/tdb2_cmds.html
   
   RDF Stats:
   - Triples: 65222513 (Approximately 65 million)
   - Subjects: 20434264 (Aproximately 20 million)
   - Objects: 8565221 (Aproximately 8 million)
   - Graphs: 213531 (Aproximately 213 thousand)
   - Predicates: 153
   
   Disk Stats:
   - my-dataset-replica/Data-0001: 23GB
   - my-dataset-replica/Data-0001/OSPG.dat: 3.5GB
   - my-dataset-replica/Data-0001/nodes.dat: 680MB
   - my-dataset-replica/Data-0001/POSG.dat: 3.6GB
   - my-dataset-replica/Data-0001/nodes.idn: 8.0M
   - my-dataset-replica/Data-0001/POSG.idn: 32M
   - my-dataset-replica/Data-0001/OSPG.idn: 32M
   - ...
   
   ### Compressed Dataset ("official" compression strategy)
   
   Description: Compressed using `/$/compact/` endpoint generating a new 
Data-NNNN folder within the same dataset.
   
   References:
   - https://jena.apache.org/documentation/tdb2/tdb2_admin.html#compaction
   
   RDF Stats:
   - Triples: 65222513 (Approximately 65 million)
   - Subjects: 20434264 (Aproximately 20 million)
   - Objects: 8565221 (Aproximately 8 million)
   - Graphs: 213531 (Aproximately 213 thousand)
   - Predicates: 153
   
   Disk Stats:
   - my-dataset/Data-0002: 23GB
   - my-dataset/Data-0002/OSPG.dat: 3.7GB
   - my-dataset/Data-0002/nodes.dat: 680MB
   - my-dataset/Data-0002/POSG.dat: 3.8GB
   - my-dataset/Data-0002/nodes.idn: 8.0M
   - my-dataset/Data-0002/POSG.idn: 40M
   - my-dataset/Data-0002/OSPG.idn: 32M
   - ...
   
   ### Comparison
   
   RDF Stats:
   - Triples: Same Count
   - Subjects: Same Count
   - Objects: Same Count
   - Graphs: Same Count
   - Predicates: Same Count
   
   Disk Stats:
   - Total Space: ~29x reduction with both strategies
   - OSPG.dat: ~69x reduction with replication and ~65x reduction with 
compression
   - nodes.dat: ~111x reduction with both strategies
   - POSG.dat: ~9,7x reduction with replication and ~7,6x reduction with 
compression
   - nodes.idn: ~4125x reduction with both strategies
   - POSG.idn: ~906x reduction with replication and ~725x reduction with 
compression
   - OSPG.idn: ~843,75 reduction with both strategies
   
   ### Queries used to obtain the RDF Stats
   
   #### Triples
   ```
   SELECT (COUNT(*) as ?count)
   WHERE {
     GRAPH ?graph {
       ?subject ?predicate ?object
     }
   }
   ```
   
   #### Graphs
   ```
   SELECT (COUNT(DISTINCT ?graph) as ?count)
   WHERE {
     GRAPH ?graph {
       ?subject ?predicate ?object
     }
   }
   ```
   
   #### Subjects
   
   ```
   SELECT (COUNT(DISTINCT ?subject) as ?count)
   WHERE {
     GRAPH ?graph {
       ?subject ?predicate ?object
     }
   }
   ```
   
   #### Predicates
   ```
   SELECT (COUNT(DISTINCT ?predicate) as ?count)
   WHERE {
     GRAPH ?graph {
       ?subject ?predicate ?object
     }
   }
   ```
   
   #### Objects
   ```
   SELECT (COUNT(DISTINCT ?object) as ?count)
   WHERE {
     GRAPH ?graph {
       ?subject ?predicate ?object
     }
   }
   ```
   
   ## Comands used to measure the Disk Stats
   
   ### File Sizes
   ```
   ls -lh --sort=size
   ```
   
   ### Directory Sizes
   ```
   du -h
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to