Re: [I] Additional THRIFT-Format without redundant Strings [jena]

via GitHub Mon, 05 Feb 2024 02:54:10 -0800


rvesse commented on issue #2240:
URL: https://github.com/apache/jena/issues/2240#issuecomment-1926709732


   So I looked at this problem extensively around 10 years ago and developed 
something we called Binary Tuple Format (BTF).  This was all proprietary 
$dayjob work and it never went beyond the research stage for several reasons 
which I'll go into depth in this response.  So no code/reports I can share but 
I can talk about it in general terms.
   
   TL;DR Version:
   
   - Using GZip compression with formats like NTriples/RDF-Thrift for graphs, 
and SPARQL-TSV for results, generally offered the best overall performance (see 
discussion for what that means)
   - Specialised formats can offer some performance gains on specific metrics, 
e.g. data size, but this tends to not be that relevant when you look at overall 
performance
   
   ----
   
   ### Binary Tuple Format
   
   Binary Tuple Format (BTF) was an effort to address the problem of how could 
we communicate results faster between the server and the client.  This was work 
at Cray so we had some specific additional constraints that we were looking to 
take account of:
   
   1. The server had vastly more memory than the client(s)
   2. The server could parallelise the results output because we didn't send 
the results directly over the network.  Rather the server wrote a file to the 
parallel file system and merely communicating the metadata (the filename, 
results count and size etc) over the wire to the client.
   3. The client was often resource constrained (particularly for memory) 
relative to the server so any format had to allow for that
   
   Design wise the format consisted of a global header (magic bytes, metadata, 
data type - triples, quads, results, result column names if applicable, prefix 
mappings etc), following by one/more data blocks.  Each data block consists of 
a block header (block type, block type specific header fields) followed by 
actual data.  This gave us a framework in which we could experiment extensively 
with different compression schemes and data layouts.
   
   Since an output file consisted of multiple blocks we could generate these in 
parallel and then write the blocks out in any sequence we wanted (unless 
ordered SPARQL results where order needed preserving).
   
   We had a number of different block types that we implemented.  From what we 
called Tabular Blocks that were basically uncompressed NTriples/SPARQL TSV 
through to [Compressed Dictionary Blocks](#compressed-dictionary-blocks) that 
used different encoding and compression techniques.
   
   ### Quantifying Overall Performance
   
   How you quantify performance becomes a key part of the problem.  It's easy 
to focus on a single metric, e.g. overall data size, but we found through 
extensive comparisons between BTF, all the standard formats supported by Jena, 
and [HDT](https://www.rdfhdt.org) that it wasn't the whole story.
   
   Firstly data is never just written, it always needs to be read at some point 
as well.  So at a bare minimum you need to consider the whole round trip i.e. 
both serialisation and parsing time.
   
   Secondly there is also a computational cost (CPU and memory) associated with 
each data format.  Applying any kind of compression is trading CPU time plus 
memory consumption for output size.  Depending on the system the overhead may 
actually outweigh the benefit.
   
   Aside - not specific to BTF but with other well known distributed compute 
frameworks we were running on our systems at the time it was actually a benefit 
to disable their default network compression behaviour.  Our network 
interconnect was so good it wasn't worth trading the CPU time for smaller data 
transfers, it was faster just to send the uncompressed data.  And as @afs notes 
networks have evolved hugely over the intervening years, that's not to say that 
there aren't cases where the network is the bottleneck, even in cloud/HPC 
environments, but that it isn't the only thing to consider these days.
   
   Which is to say that any judgement of improved performance has to be 
holistic, it's not enough to just say that your new format consistently gives 
smaller file sizes if it does so at the cost of increased processing time 
and/or memory consumption.  You **MUST** evaluate all those factors together 
somehow, e.g. weighted scoring, and I'm not sure we ever came up with a good 
metric for that.
   
   ### Standards Matter
   
   ![XKCD Standards Cartoon](https://imgs.xkcd.com/comics/standards.png)
   
   The RDF/Semantic Web community already has a problem IMO with 
over-proliferation of standards.  For any new format to be broadly useful it 
has to be adopted across the wider community.  If it's only in Jena it only 
benefits users whose entire workflow/lifecycle can be encompassed in Jena.  As 
soon as they go outside of Jena they lose the benefits.
   
   This might be fine for your specific use case, but if it doesn't benefit the 
majority of users are we actually adding sufficient value to justify it?
   
   In most real-world use cases I've seen Jena is rarely involved in 100% of 
the workflow.  Typically Jena's involvement terminates at Fuseki as a SPARQL 
endpoint and then other parts of the business use JavaScript, Python etc to 
implement UIs, data science etc on top of the SPARQL results.
   
   ### Compressed Dictionary Blocks
   
   Going back to BTF, the data block type that ended up being the best overall 
performance (caveats on that term as I described earlier) was what we called a 
Compressed Dictionary Block.  This block type consisted of a fixed size 
dictionary that was compressed with Gzip, followed by a series of rows.  Each 
row is a fixed number of bytes with `N` bytes for each column, where the value 
of `N` depends on the size of dictionary in use.  The value for each column is 
the offset for the associated RDF term in the blocks dictionary.  We assigned 
some special offsets to have special meanings e.g. `0` means no value 
(particularly relevant for SPARQL Results), `1` for repeat term from same 
column in previous row etc.
   
   From experimentation we found 16MB to be the optimal maximum dictionary size 
as that meant `N` was 4 bytes so each row would be `4M` bytes where `M` is the 
number of columns in each row (3 for triples, 4 for quads, arbitrary for SPARQL 
results).
   
   Having a fixed maximum dictionary size had a couple of benefits:
   
   - We could populate multiple blocks of data in parallel 
   - It limits how much data any client needs to process at once if it stream 
processes the results.  At most a client must hold the 16MB dictionary in 
memory and then read an additional `4M` bytes for each row.
   
   Since rows are fixed width you can also achieve random access within the 
compressed file by reading very small portions of the global and block headers 
to seek to the position of any given tuple in the file.
   
   #### Advanced Dictionaries
   
   We also experimented with a lot of tricks around the building of the 
dictionary though none of those ended up yielding that much benefit.
   
   For example in an early iteration we used [variable 
width](https://en.wikipedia.org/wiki/Variable-length_quantity) offset encoding 
to reference the dictionary i.e. the rows contain `1` to `N` bytes for each 
offset, this was combined with a form of [Huffman 
coding](https://en.wikipedia.org/wiki/Huffman_coding) for the dictionary so 
that the most common terms would be given lower offsets in the dictionary and 
thus have smaller references in the row portion of the block.  
   
   In principal this meant that the rows portion of the block should be 
smaller.  In practise this proved to not be the case because with a variable 
width encoding you end up using more bytes overall because the terms at the end 
of the dictionary end up needing 5 bytes to be encoded (assuming a 16MB 
dictionary) due to how variable width encodings work.
   
   Also having variable width rows adds some complexity to the row reading 
logic the cost of which can end up outweighing the benefit of sometimes 
marginally smaller file size.  Plus if you want to allow random access to an 
arbitrary row within the result set we now have to manually read through much 
of the file to do this because we don't know how many bytes to skip in advance. 
 We ended up throwing out all usage of variable width encodings as a result.
   
   ### Performance Comparisons
   
   As noted [earlier](#quantifying-overall-performance) we did extensive 
comparisons of BTF against all the built-in Jena formats (for both RDF graphs, 
datasets and SPARQL results) and some other academic research like HDT.
   
   The end results was that BTF was about the same size as the equivalent 
Gzipped NTriples/SPARQL TSV **BUT** was time wise was slower to both read and 
write so overall staying with the standard formats proved to be better.  I 
would also note that over the lifetime of this research project if anything the 
gap between the Jena built-in's and BTF actually increased as optimisations 
were introduced into the Jena parsers and writers and the JVM optimiser got 
better.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Additional THRIFT-Format without redundant Strings [jena]

Reply via email to