Re: [I] Additional THRIFT-Format without redundant Strings [jena]

via GitHub Sun, 04 Feb 2024 09:58:06 -0800


afs commented on issue #2240:
URL: https://github.com/apache/jena/issues/2240#issuecomment-1925865098


   > What do you think?
   
   Some initial thoughts - I haven't dived deep into the new code yet. 
   
   Improvement in results formats is the most interesting part here. The 
results case is interesting in that it is, typically, create-once, read-once so 
writing costs matter.
   
   As you note, gzip gives a lot of benefit for the N-Triple like formats 
(x8-x10 compression). And for publishing data, sticking to standards based 
formats matters.
   
   **RAM space**
   
   All language parsers go through a String to IRI cache (`ParserProfileStd` - 
it's per parser run cache) this reduces the memory costs - it saves about 1/3 
of the RAM needed to store a large in-memory dataset. SPARQL results sets do 
not. 
   
   We could have a JVM wide term cache (needs to be multithreading safe) - 
across parser runs and shared with receiving results.
   
   **Results**
   
   So I think the improvement metric should be "faster", and whether size is a 
factor needs to be determined. It probably is for across the web. The size 
impact is made complicated because HTTP speeds over the public internet, over 
average corporate networks and in AWS/Azure/GCP are very different. The latter 
can be incredible fast.
   
   It'll good to have a really good, non-standards, results format.
   
   ** BSBM data**
   
   > bsbm-1m.nt.gz
   
   My copy of `bsbm-1m` in RDF-THRIFT is 254M uncompressed and 28M compressed.
   
   On my machine, RDF-THRIFT:
   Disk is an NVMe SSD although the test data is probably in the disk cache of 
the OS. I don't believe that I/O is a significant factor although it is a 
factor. The bandwidth on SSDs is really quite good.
   
   ```
   bsbm-25m.rt : 6.3G
   bsbm-25m.rt.gz : 684M
   ```
   
   ```
   $ riot --time --sink bsbm-25m.rt
   bsbm-25m.rt     : 18.86 sec : 24,997,044 Triples : 1,325,329.73 per second
   $ riot --time --sink bsbm-25m.rt.gz
   bsbm-25m.rt.gz  : 24.02 sec : 24,997,044 Triples : 1,040,719.60 per second
   ```
   x8-x10 is what I expect when compressing N-triples. BSBM may be different in 
that it has a significant amount of longer strings randomly generated (IIRC).
   
   At those speeds, parsing isn't a factor in data loading into TDB2. The TDB2 
parallel loader does the parsing on a separate thread, passes inter-thread in 
chunks of 100k triples and has a queue of 10 blocks. It becomes waiting for the 
loader to consume the triples.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Additional THRIFT-Format without redundant Strings [jena]

Reply via email to