Re: [I] Additional THRIFT-Format without redundant Strings [jena]

via GitHub Mon, 03 Feb 2025 07:45:44 -0800


Ostrzyciel commented on issue #2240:
URL: https://github.com/apache/jena/issues/2240#issuecomment-2631354887


   I've only now found this thread... the discussion here follows a lot of the 
points that I encountered when working on [Jelly](https://w3id.org/jelly). I 
will put my findings here in the hope that it will be more useful than annoying 
:)
   
   - Better compression doesn't necessarily mean more CPU time has to be used. 
In Jelly at some point I've found that by for example de-duplicating the data 
to be serialized, you also save the CPU that would be normally needed to encode 
the duplicated data (allocating objects, encoding strings, transfers in/out the 
memory). So, by spending some time in compression, you save more time later. 
These kinds of optimizations were key to making Jelly what it is. Of course, 
this only works to a certain degree.
   - Dictionary sizes (and in general memory usage) must be strictly bounded, 
while the data may be unbounded. The only solution to this problem is to design 
your compressor as a stream compressor that only looks at one triple/quad at a 
time and updates its dictionaries incrementally, while evicting old entries. 
RDF4J Binary also does this, but it also does a look-ahead in the stream to 
better estimate frequency statistics. This hurts latency, so Jelly doesn't do 
the look-ahead part.
   - Wide support among many implementations is hugely important. Jelly 
supports equally Jena and RDF4J (the core is generic classes) via the 
respective plugin infrastructures. We are working on implementations for Python 
and Rust. I consider this the highest priority.
   - Jelly doesn't currently support SPARQL results. I will be working on that 
in the coming months.
   - Node caching in the parser helps with performance a lot, especially if the 
format already indexes the nodes in some way. Jelly parser eliminates most node 
duplicates on the heap with its size-limited node cache that is allocated 
per-parser. A global node cache for Jena would help with multi-parser scenarios 
and could potentially boost the in-mem performance a lot, but I currently have 
no idea how to implement that efficiently.
   - It is possible to get better compression ratios with binary + gzip than 
N-Triples + gzip. In a test on a diverse mix of datasets ([RiverBench 
2.1.0](https://riverbench.github.io/v/2.1.0/datasets/)):
     - N-Triples (baseline): 100% size
     - N-Triples + gzip: 1.59%
     - Jena Thrift + gzip: 1.65%
     - Jelly: 1.04%
     - but this is beaten by Turtle + gzip, with 0.97% compression.
     - This mostly comes down to gzip being a format that doesn't handle binary 
data that well. If you use zstd, Jelly gets down to 0.87%, versus 0.89% for 
Turtle. Marginally better.
     - But yeah, in general the issue is that while you can do cool 
RDF-specific compression, the additional entropy from field tags and whatnots 
will be detrimental to the ratio.
   - I tested Jelly with end-to-end streaming on Kafka and it was the fastest 
thing, regardless of whether compression was used or not. I don't have the nice 
results yet, as I'm in the middle of massaging them into something that the 
scientific committee would accept as a "worthy dissertation". But I admit that 
getting to this point is rather hard.
   
   I realize this is a closed thread, so if you'd like to continue the 
discussion, I guess you probably should contact me in private or open an issue 
in the Jelly repo. If anyone here also wants to collaborate on Jelly, I'm 
always open to that :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Additional THRIFT-Format without redundant Strings [jena]

Reply via email to