afs commented on issue #2240: URL: https://github.com/apache/jena/issues/2240#issuecomment-1925865098
> What do you think? Some initial thoughts - I haven't dived deep into the new code yet. Improvement in results formats is the most interesting part here. The results case is interesting in that it is, typically, create-once, read-once so writing costs matter. As you note, gzip gives a lot of benefit for the N-Triple like formats (x8-x10 compression). And for publishing data, sticking to standards based formats matters. **RAM space** All language parsers go through a String to IRI cache (`ParserProfileStd` - it's per parser run cache) this reduces the memory costs - it saves about 1/3 of the RAM needed to store a large in-memory dataset. SPARQL results sets do not. We could have a JVM wide term cache (needs to be multithreading safe) - across parser runs and shared with receiving results. **Results** So I think the improvement metric should be "faster", and whether size is a factor needs to be determined. It probably is for across the web. The size impact is made complicated because HTTP speeds over the public internet, over average corporate networks and in AWS/Azure/GCP are very different. The latter can be incredible fast. It'll good to have a really good, non-standards, results format. ** BSBM data** > bsbm-1m.nt.gz My copy of `bsbm-1m` in RDF-THRIFT is 254M uncompressed and 28M compressed. On my machine, RDF-THRIFT: Disk is an NVMe SSD although the test data is probably in the disk cache of the OS. I don't believe that I/O is a significant factor although it is a factor. The bandwidth on SSDs is really quite good. ``` bsbm-25m.rt : 6.3G bsbm-25m.rt.gz : 684M ``` ``` $ riot --time --sink bsbm-25m.rt bsbm-25m.rt : 18.86 sec : 24,997,044 Triples : 1,325,329.73 per second $ riot --time --sink bsbm-25m.rt.gz bsbm-25m.rt.gz : 24.02 sec : 24,997,044 Triples : 1,040,719.60 per second ``` x8-x10 is what I expect when compressing N-triples. BSBM may be different in that it has a significant amount of longer strings randomly generated (IIRC). At those speeds, parsing isn't a factor in data loading into TDB2. The TDB2 parallel loader does the parsing on a separate thread, passes inter-thread in chunks of 100k triples and has a queue of 10 blocks. It becomes waiting for the loader to consume the triples. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
