Re: [I] Additional THRIFT-Format without redundant Strings [jena]

via GitHub Mon, 05 Feb 2024 03:09:56 -0800


rvesse commented on issue #2240:
URL: https://github.com/apache/jena/issues/2240#issuecomment-1926737207


   > The current THRIFT format is great, but in practice there are many 
duplicate strings that unnecessarily bloat the serialised data. Idea: Extend 
the format with a duplicate-free list of all occurring strings. And then only 
point to the index of the string in the remaining fields with an i32. For 
entire graphs, there would be exactly one list of strings. With the 
stream-based formats for RDF and results, there would be one such list per row, 
which would be built up by the consumer into a complete list, as each row only 
contains the strings that have been added. Apart from this change, the format 
would otherwise be identical.
   
   You need to do this carefully such that you still allow stream processing of 
the data because that's very common in real-world use cases.  Also there is a 
potential memory issue here in that a processor has to continuously build up 
the string table as it goes along (both for reading/writing).
   
   In the pathological case a malicious user could construct a SPARQL query 
using a function like `UUID()` that generates a massive stream of results where 
the strings never repeat and effectively create a DoS attack against the server.
   
   > 
   > The speed of serialisation and deserialisation in memory hardly differs 
between both THRIFT-based formats.
   
   The in-memory worries me here.  Reality is that these serialisations are 
typically exchanged either via a file system or network where IO overhead can 
often dominate.  If we're going to do benchmarks we need to account for that 
somehow.
   
   > With streaming, classic compression is of little use because it can only 
compress the character strings that occur in the current row. The previously 
transmitted character strings are lost for compression. --> This problem is 
probably ideally addressed by the proposed new format. --> This way, the 
proposed new format could lead to a massive reduction in network traffic when 
transporting query results.
   
   General purpose stream compression, i.e. Gzip, is usually fine provided you 
choose an appropriate buffer size for the Gzip stream.  This avoids the problem 
you imply here.  Compression schemes (e.g. Deflate) in formats like Gzip 
typically have a maximum size window for the references to previously used 
strings, so as long as your buffer size is appropriate you get good enough 
compression.  
   
   In the pathological case with all unique strings any compression scheme will 
struggle but general purpose ones generally cope the best in such scenarios 
(potentially even dropping back to uncompressed stored data blocks where it 
makes sense).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Additional THRIFT-Format without redundant Strings [jena]

Reply via email to