rvesse commented on issue #2240: URL: https://github.com/apache/jena/issues/2240#issuecomment-1926737207
> The current THRIFT format is great, but in practice there are many duplicate strings that unnecessarily bloat the serialised data. Idea: Extend the format with a duplicate-free list of all occurring strings. And then only point to the index of the string in the remaining fields with an i32. For entire graphs, there would be exactly one list of strings. With the stream-based formats for RDF and results, there would be one such list per row, which would be built up by the consumer into a complete list, as each row only contains the strings that have been added. Apart from this change, the format would otherwise be identical. You need to do this carefully such that you still allow stream processing of the data because that's very common in real-world use cases. Also there is a potential memory issue here in that a processor has to continuously build up the string table as it goes along (both for reading/writing). In the pathological case a malicious user could construct a SPARQL query using a function like `UUID()` that generates a massive stream of results where the strings never repeat and effectively create a DoS attack against the server. > > The speed of serialisation and deserialisation in memory hardly differs between both THRIFT-based formats. The in-memory worries me here. Reality is that these serialisations are typically exchanged either via a file system or network where IO overhead can often dominate. If we're going to do benchmarks we need to account for that somehow. > With streaming, classic compression is of little use because it can only compress the character strings that occur in the current row. The previously transmitted character strings are lost for compression. --> This problem is probably ideally addressed by the proposed new format. --> This way, the proposed new format could lead to a massive reduction in network traffic when transporting query results. General purpose stream compression, i.e. Gzip, is usually fine provided you choose an appropriate buffer size for the Gzip stream. This avoids the problem you imply here. Compression schemes (e.g. Deflate) in formats like Gzip typically have a maximum size window for the references to previously used strings, so as long as your buffer size is appropriate you get good enough compression. In the pathological case with all unique strings any compression scheme will struggle but general purpose ones generally cope the best in such scenarios (potentially even dropping back to uncompressed stored data blocks where it makes sense). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
