rvesse commented on issue #2240: URL: https://github.com/apache/jena/issues/2240#issuecomment-1926709732
So I looked at this problem extensively around 10 years ago and developed something we called Binary Tuple Format (BTF). This was all proprietary $dayjob work and it never went beyond the research stage for several reasons which I'll go into depth in this response. So no code/reports I can share but I can talk about it in general terms. TL;DR Version: - Using GZip compression with formats like NTriples/RDF-Thrift for graphs, and SPARQL-TSV for results, generally offered the best overall performance (see discussion for what that means) - Specialised formats can offer some performance gains on specific metrics, e.g. data size, but this tends to not be that relevant when you look at overall performance ---- ### Binary Tuple Format Binary Tuple Format (BTF) was an effort to address the problem of how could we communicate results faster between the server and the client. This was work at Cray so we had some specific additional constraints that we were looking to take account of: 1. The server had vastly more memory than the client(s) 2. The server could parallelise the results output because we didn't send the results directly over the network. Rather the server wrote a file to the parallel file system and merely communicating the metadata (the filename, results count and size etc) over the wire to the client. 3. The client was often resource constrained (particularly for memory) relative to the server so any format had to allow for that Design wise the format consisted of a global header (magic bytes, metadata, data type - triples, quads, results, result column names if applicable, prefix mappings etc), following by one/more data blocks. Each data block consists of a block header (block type, block type specific header fields) followed by actual data. This gave us a framework in which we could experiment extensively with different compression schemes and data layouts. Since an output file consisted of multiple blocks we could generate these in parallel and then write the blocks out in any sequence we wanted (unless ordered SPARQL results where order needed preserving). We had a number of different block types that we implemented. From what we called Tabular Blocks that were basically uncompressed NTriples/SPARQL TSV through to [Compressed Dictionary Blocks](#compressed-dictionary-blocks) that used different encoding and compression techniques. ### Quantifying Overall Performance How you quantify performance becomes a key part of the problem. It's easy to focus on a single metric, e.g. overall data size, but we found through extensive comparisons between BTF, all the standard formats supported by Jena, and [HDT](https://www.rdfhdt.org) that it wasn't the whole story. Firstly data is never just written, it always needs to be read at some point as well. So at a bare minimum you need to consider the whole round trip i.e. both serialisation and parsing time. Secondly there is also a computational cost (CPU and memory) associated with each data format. Applying any kind of compression is trading CPU time plus memory consumption for output size. Depending on the system the overhead may actually outweigh the benefit. Aside - not specific to BTF but with other well known distributed compute frameworks we were running on our systems at the time it was actually a benefit to disable their default network compression behaviour. Our network interconnect was so good it wasn't worth trading the CPU time for smaller data transfers, it was faster just to send the uncompressed data. And as @afs notes networks have evolved hugely over the intervening years, that's not to say that there aren't cases where the network is the bottleneck, even in cloud/HPC environments, but that it isn't the only thing to consider these days. Which is to say that any judgement of improved performance has to be holistic, it's not enough to just say that your new format consistently gives smaller file sizes if it does so at the cost of increased processing time and/or memory consumption. You **MUST** evaluate all those factors together somehow, e.g. weighted scoring, and I'm not sure we ever came up with a good metric for that. ### Standards Matter  The RDF/Semantic Web community already has a problem IMO with over-proliferation of standards. For any new format to be broadly useful it has to be adopted across the wider community. If it's only in Jena it only benefits users whose entire workflow/lifecycle can be encompassed in Jena. As soon as they go outside of Jena they lose the benefits. This might be fine for your specific use case, but if it doesn't benefit the majority of users are we actually adding sufficient value to justify it? In most real-world use cases I've seen Jena is rarely involved in 100% of the workflow. Typically Jena's involvement terminates at Fuseki as a SPARQL endpoint and then other parts of the business use JavaScript, Python etc to implement UIs, data science etc on top of the SPARQL results. ### Compressed Dictionary Blocks Going back to BTF, the data block type that ended up being the best overall performance (caveats on that term as I described earlier) was what we called a Compressed Dictionary Block. This block type consisted of a fixed size dictionary that was compressed with Gzip, followed by a series of rows. Each row is a fixed number of bytes with `N` bytes for each column, where the value of `N` depends on the size of dictionary in use. The value for each column is the offset for the associated RDF term in the blocks dictionary. We assigned some special offsets to have special meanings e.g. `0` means no value (particularly relevant for SPARQL Results), `1` for repeat term from same column in previous row etc. From experimentation we found 16MB to be the optimal maximum dictionary size as that meant `N` was 4 bytes so each row would be `4M` bytes where `M` is the number of columns in each row (3 for triples, 4 for quads, arbitrary for SPARQL results). Having a fixed maximum dictionary size had a couple of benefits: - We could populate multiple blocks of data in parallel - It limits how much data any client needs to process at once if it stream processes the results. At most a client must hold the 16MB dictionary in memory and then read an additional `4M` bytes for each row. Since rows are fixed width you can also achieve random access within the compressed file by reading very small portions of the global and block headers to seek to the position of any given tuple in the file. #### Advanced Dictionaries We also experimented with a lot of tricks around the building of the dictionary though none of those ended up yielding that much benefit. For example in an early iteration we used [variable width](https://en.wikipedia.org/wiki/Variable-length_quantity) offset encoding to reference the dictionary i.e. the rows contain `1` to `N` bytes for each offset, this was combined with a form of [Huffman coding](https://en.wikipedia.org/wiki/Huffman_coding) for the dictionary so that the most common terms would be given lower offsets in the dictionary and thus have smaller references in the row portion of the block. In principal this meant that the rows portion of the block should be smaller. In practise this proved to not be the case because with a variable width encoding you end up using more bytes overall because the terms at the end of the dictionary end up needing 5 bytes to be encoded (assuming a 16MB dictionary) due to how variable width encodings work. Also having variable width rows adds some complexity to the row reading logic the cost of which can end up outweighing the benefit of sometimes marginally smaller file size. Plus if you want to allow random access to an arbitrary row within the result set we now have to manually read through much of the file to do this because we don't know how many bytes to skip in advance. We ended up throwing out all usage of variable width encodings as a result. ### Performance Comparisons As noted [earlier](#quantifying-overall-performance) we did extensive comparisons of BTF against all the built-in Jena formats (for both RDF graphs, datasets and SPARQL results) and some other academic research like HDT. The end results was that BTF was about the same size as the equivalent Gzipped NTriples/SPARQL TSV **BUT** was time wise was slower to both read and write so overall staying with the standard formats proved to be better. I would also note that over the lifetime of this research project if anything the gap between the Jena built-in's and BTF actually increased as optimisations were introduced into the Jena parsers and writers and the JVM optimiser got better. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
