Ostrzyciel commented on issue #2240: URL: https://github.com/apache/jena/issues/2240#issuecomment-2631354887
I've only now found this thread... the discussion here follows a lot of the points that I encountered when working on [Jelly](https://w3id.org/jelly). I will put my findings here in the hope that it will be more useful than annoying :) - Better compression doesn't necessarily mean more CPU time has to be used. In Jelly at some point I've found that by for example de-duplicating the data to be serialized, you also save the CPU that would be normally needed to encode the duplicated data (allocating objects, encoding strings, transfers in/out the memory). So, by spending some time in compression, you save more time later. These kinds of optimizations were key to making Jelly what it is. Of course, this only works to a certain degree. - Dictionary sizes (and in general memory usage) must be strictly bounded, while the data may be unbounded. The only solution to this problem is to design your compressor as a stream compressor that only looks at one triple/quad at a time and updates its dictionaries incrementally, while evicting old entries. RDF4J Binary also does this, but it also does a look-ahead in the stream to better estimate frequency statistics. This hurts latency, so Jelly doesn't do the look-ahead part. - Wide support among many implementations is hugely important. Jelly supports equally Jena and RDF4J (the core is generic classes) via the respective plugin infrastructures. We are working on implementations for Python and Rust. I consider this the highest priority. - Jelly doesn't currently support SPARQL results. I will be working on that in the coming months. - Node caching in the parser helps with performance a lot, especially if the format already indexes the nodes in some way. Jelly parser eliminates most node duplicates on the heap with its size-limited node cache that is allocated per-parser. A global node cache for Jena would help with multi-parser scenarios and could potentially boost the in-mem performance a lot, but I currently have no idea how to implement that efficiently. - It is possible to get better compression ratios with binary + gzip than N-Triples + gzip. In a test on a diverse mix of datasets ([RiverBench 2.1.0](https://riverbench.github.io/v/2.1.0/datasets/)): - N-Triples (baseline): 100% size - N-Triples + gzip: 1.59% - Jena Thrift + gzip: 1.65% - Jelly: 1.04% - but this is beaten by Turtle + gzip, with 0.97% compression. - This mostly comes down to gzip being a format that doesn't handle binary data that well. If you use zstd, Jelly gets down to 0.87%, versus 0.89% for Turtle. Marginally better. - But yeah, in general the issue is that while you can do cool RDF-specific compression, the additional entropy from field tags and whatnots will be detrimental to the ratio. - I tested Jelly with end-to-end streaming on Kafka and it was the fastest thing, regardless of whether compression was used or not. I don't have the nice results yet, as I'm in the middle of massaging them into something that the scientific committee would accept as a "worthy dissertation". But I admit that getting to this point is rather hard. I realize this is a closed thread, so if you'd like to continue the discussion, I guess you probably should contact me in private or open an issue in the Jelly repo. If anyone here also wants to collaborate on Jelly, I'm always open to that :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
