Re: [I] Additional THRIFT-Format without redundant Strings [jena]

via GitHub Mon, 12 Feb 2024 14:30:32 -0800


arne-bdt commented on issue #2240:
URL: https://github.com/apache/jena/issues/2240#issuecomment-1939717414


   Thank you very much for all your feedback.
   
   @rvesse, regarding your attack scenario where the server could run out of 
memory due to storing all strings until the last result row is sent: that is 
indeed a valid concern. This vulnerability might make the format less suitable 
for public SPARQL servers. However, within our cluster, for service-to-service 
and service-to-UI client communications, it could serve as an effective 
bandwidth-saving alternative.
   
   Regarding your concerns about the in-memory measurement, since bandwidth and 
I/O performance vary greatly across different scenarios, I thought it prudent 
to exclude these factors from the performance measurements. Are you suggesting 
that different disk flushing strategies might not have been adequately 
considered?
   
   
   @afs, it appears that the Protobuf and Thrift formats do not utilize 
ParserProfileStd, which uses CacheCaffeine with a default cache size of 500. 
Instead, they employ CacheSimple with a default cache size of 5000. The 
rationale behind the use of two different LRU cache implementations and varying 
default cache sizes is unclear to me. Neither Protobuf nor Thrift seems to 
utilize FactoryRDFCaching, and no caching is applied to results, only to 
graphs. 
   --> I could try to use the IRI Cache for the results too and try to measure 
the impact. Would that be helpful?
   --> I could try to use FactoryRDFCaching#createTypedLiteral for the graph 
and result deserialization and measure the impact. Would that be helpful?
   
   The idea of a JVM-wide term cache is intriguing, but I am uncertain about 
its potential to slow down the system under heavy multithreaded workloads. An 
alternative could be making it optionally thread-local. Another concern is 
determining the optimal size for such a cache.
   
   
   ### Back to my current approach:
   Introducing another data format is not my intention.
   
   To identify the optimal format for our needs, we conducted extensive 
benchmarks across various serialization formats, combined with different 
compression algorithms and levels. We also considered the frequency of graph 
writes versus reads and the implications for disk space over time. For our use 
case, RDF_THRIFT_VALUES (soon to be RDF_THRIFT) combined with LZ4_fastest 
emerged as the best choice.
   Our active datasets are maintained as in-memory graphs to ensure peak 
performance, with only changes being written as deltas into the RDBMS, 
following a CQRS with event sourcing pattern. A significant bottleneck we've 
identified is the substantial data traffic generated by services executing 
numerous parallel queries.
   
   Upon examining the THRIFT results format, I noticed that all data fields 
contained strings, with many repetitions.
   In my initial implementation, I simply adapted the THRIFT format and code, 
substituting all strings with i32 indices and adding string lists to the rows. 
The modifications to the code were minimal. 
   For deserialization in Java, the fastest approach seems to be using a single 
ArrayList<String>, adding new strings with #addAll, and retrieving 
corresponding strings with #get(index). 
   The downside is that the consumer must retain the expanding string list in 
memory until the stream is processed. 
   The upside is its simplicity and straightforwardness, with an equivalent to 
ArrayList<String> likely available in other programming languages.
   
   THRIFT2 is a variation of THRIFT:
   ![2024-02-05 21_13_53-jena – BinaryRDF 
thrift](https://github.com/apache/jena/assets/53625519/3b2513c5-677e-4a80-9152-a43b55d99f5c)
   
   I've conducted JMH-benchmarks on several graphs that I had access to:
   - pizza.owl.rdf and cheese.ttl (I sometimes left these out of some 
benchmarks due to their relatively small size)
   - RealGrid_EQ.xml, RealGrid_SSH.xml, RealGrid_TP.xml, RealGrid_SV.xml (these 
are sample datasets from the CGMES conformity assessment, available for 
download at [ENTSO-E]
   - 
https://www.entsoe.eu/Documents/CIM_documents/Grid_Model_CIM/TestConfigurations_packageCASv2.0.zip))
   xxx_CGMES_EQ.xml, xxx_CGMES_SSH.xml, xxx_CGMES_TP.xml (these are real 
datasets from one of our projects and are not publicly available)
   - bsbm-1m.nt.gz
   Each of these datasets, especially the EQ, SSH, TP, and SV graphs, has a 
unique structure, leading to varying distributions of subjects, predicates, and 
objects.
   
   Since additional compression often seems to be relevant, I measured 
uncompressed, GZIP-compressed and LZ4Fastest-compressed.
   
   Regarding the SPARQL query results, I opted for a straightforward "SELECT ?s 
?p ?o FROM ..." query. Based on my observations, I don't believe the choice of 
query would significantly alter the outcomes.
   
   Here I compared THRIFT with my implementation (currently named THRIFT2):
   
   **Speed of a roundtrip**
   The simplest approach to measure speed is a simple roundtrip of 
serialization and deserialization. If the producer has significantly more or 
less resources than the consumer, this perspective is, of course, too 
simplified. If data is written more often than read, or the other way around, 
with other formats, this perspective is too simplified.
   
   ![Speed of SPARQL-Result Serialization+Deserialization-Roundtrip - Thrift vs 
 
Thrif2](https://github.com/apache/jena/assets/53625519/5764ef54-3e8c-4e0c-a9e7-0b608c41f208)
   --> For SPARQL-results, THRIFT2 outperforms THRIFT in all cases and with all 
tested compressions.
   
   ![Speed of Graph Serialization+Deserialization-Roundtrip - Thirft vs  
Thrift2](https://github.com/apache/jena/assets/53625519/ef4f9705-cf43-4ed0-90ee-50ac2dd73732)
   --> For graphs, THRIFT2 outperforms THRIFT in combination with compression 
but not uncompressed.
   
   **Size**
   For the size, I simply calculated the number of triples per MB:
   
   ![Serialized SPARQL-Results - Thrift vs  
Thrif2](https://github.com/apache/jena/assets/53625519/7d86b23f-76ad-44ef-84aa-526f012e4c42)
   --> Uncompressed THRIFT2 SPARQL-results are in all cases at least 50% 
smaller than uncompressed THRIFT.
   
   ![Serialized Graphs - Thrift vs  
Thrif2](https://github.com/apache/jena/assets/53625519/90bf4354-0aa8-4848-afbf-9d73461f3e8f)
   --> Uncompressed THRIFT2 graphs are in all cases at least 50% smaller than 
uncompressed THRIFT.
   
   **Speed of serialization**
   ![Speed of SPARQL-Result Serialization - Thrift vs  
Thrif2](https://github.com/apache/jena/assets/53625519/b79cafe9-db0a-4adb-9c53-4d8de3cfbba6)
   ![Speed of Graph Serialization - Thrift vs  
Thrif2](https://github.com/apache/jena/assets/53625519/94e15fdc-d6f7-42d4-b834-3c88dc88f6f9)
   --> Overall, THRIFT2 is faster than THRIFT during serialization.
   
   **Speed of deserialization**
   
   ![Speed of SPARQL-Result Deserialization - Thrift vs  
Thrif2](https://github.com/apache/jena/assets/53625519/79b41cba-c434-452b-a72c-7b40be89851c)
   --> In combination with compression, THRIFT2 outperforms THRIFT for 
SPARQL-results. Uncompressed, there is no clear winner.
   
   ![Speed of Graph Deserialization - Thrift vs  
Thrif2](https://github.com/apache/jena/assets/53625519/2c1d6eba-ea8e-4c03-9310-c59e18c61a8b)
   --> For graphs, THRIFT2 outperforms THRIFT in combination with compression 
but not uncompressed.
   
   **Questions**
   What do you think? 
   Are the advantages shown in the benchmarks too small to justify an 
additional format? 
   Should I move forward with the THRIFT variant, or maybe only with the 
results format but not with the graphs format? 
   Would it be appropriate to move THRIFT2 (or whatever name it will have) to a 
separate package, to avoid introducing yet another format?
   
   ### Benchmarks for SPARQL-result serialization/deserialization
   Since I've conducted some benchmarks, it might be interesting to compare all 
available formats here.
   
   Note: I tried the same approach with PROTO, but Protobuf itself seems to 
behave differently when deserializing string lists. Deserialization in PROTO2 
is, in my opinion, too slow to be considered an option.
   
   **Size**
   ![Serialized SPARQL-Results - 
uncompressed](https://github.com/apache/jena/assets/53625519/02f2f50e-c6b3-4720-b89f-b5df3957cae4)
   ![Serialized SPARQL-Results - with LZ4 
compression](https://github.com/apache/jena/assets/53625519/842e0ca9-eb2e-4e7e-b748-b07897cdf755)
   ![Serialized Graphs - 
uncompressed](https://github.com/apache/jena/assets/53625519/833f4bb4-9332-4f86-86b6-2699aceba308)
   
   **Speed of a roundtrip**
   ![Speed of SPARQL-Result Serialization+Deserialization-Roundtrip - 
uncompressed](https://github.com/apache/jena/assets/53625519/2d05805b-8f40-40f4-be52-1adab2ed105d)
   ![Speed of SPARQL-Result Serialization+Deserialization-Roundtrip - with LZ4 
compression](https://github.com/apache/jena/assets/53625519/26b0b0fd-032a-4028-afac-ec46acb3fe37)
   ![Speed of SPARQL-Result Serialization+Deserialization-Roundtrip - with GZIP 
compression](https://github.com/apache/jena/assets/53625519/f038dcc6-bd17-40d8-a476-bca9554891e4)
   
   **Speed of serialization**
   ![Speed of SPARQL-Result Serialization - 
uncompressed](https://github.com/apache/jena/assets/53625519/dc24322a-7467-43d9-823a-b5530f40e955)
   ![Speed of SPARQL-Result Serialization - with LZ4 
compression](https://github.com/apache/jena/assets/53625519/488d7170-66b4-4869-acd6-87ebc0d74b84)
   ![Speed of SPARQL-Result Serialization - with GZIP 
compression](https://github.com/apache/jena/assets/53625519/4b70a724-ff23-4796-b4cc-6a12c5c777f2)
   
   **Speed of deserialization**
   ![Speed of SPARQL-Result Derialization - 
uncompressed](https://github.com/apache/jena/assets/53625519/615c31be-b140-4138-b880-6ca0c008f164)
   ![Speed of SPARQL-Result Derialization - with LZ4 
compression](https://github.com/apache/jena/assets/53625519/53773d30-e424-47a0-a5e1-70b86cfde4b6)
   ![Speed of SPARQL-Result Derialization - with GZIP 
compression](https://github.com/apache/jena/assets/53625519/868e60f8-3ffc-4be1-a8e8-c788149994ee)
   
   ### Benchmarks for graph serialization/deserialization
   
   **Size**
   ![Serialized Graphs - 
uncompressed](https://github.com/apache/jena/assets/53625519/2815ddc9-b167-4d63-8e8e-0a9ed46cb68f)
   ![Serialized Graphs - with LZ4 
compression](https://github.com/apache/jena/assets/53625519/839978f9-e496-4667-98d9-7901b19f21d4)
   ![Serialized Graphs - with GZIP 
compression](https://github.com/apache/jena/assets/53625519/774471f6-17bc-4817-bda3-ed636bd385e9)
   
   **Speed of a roundtrip**
   ![Speed of Graph Serialization+Deserialization-Roundtrip - 
uncompressed](https://github.com/apache/jena/assets/53625519/405d208e-9c94-4506-b743-44911a9cdd98)
   ![Speed of Graph Serialization+Deserialization-Roundtrip - with LZ4 
compression](https://github.com/apache/jena/assets/53625519/bf1b2fda-ff11-4576-94dd-18ac844cee22)
   ![Speed of Graph Serialization+Deserialization-Roundtrip - with GZIP 
compression](https://github.com/apache/jena/assets/53625519/255cb77b-2dcf-4ede-ad9e-92b99e91122c)
   
   **Speed of serialization**
   ![Speed of Graph Serialization - 
uncompressed](https://github.com/apache/jena/assets/53625519/5dd7b433-ce90-436d-abb7-421a24449f70)
   ![Speed of Graph Serialization - with LZ4 
compression](https://github.com/apache/jena/assets/53625519/8f25f0ee-09b0-4cf8-8900-fba2f694abcd)
   ![Speed of Graph Serialization - with GZIP 
compression](https://github.com/apache/jena/assets/53625519/6fbf9134-5bb8-4e55-a56e-965a1253bafe)
   
   
   **Speed of deserialization**
   ![Speed of Graph Deserialization - 
uncompressed](https://github.com/apache/jena/assets/53625519/ad85146f-7428-49f7-a010-2b77998ec839)
   ![Speed of Graph Deserialization - with LZ4 
compression](https://github.com/apache/jena/assets/53625519/a28c59e6-fa5e-408b-aaad-8a675eb3a70f)
   ![Speed of Graph Deserialization - with GZIP 
compression](https://github.com/apache/jena/assets/53625519/68d737e4-63f8-47e4-b297-34ce290c4139)
   
   My excel file with all the numbers: 
   [Apache Jena 
Serialization.xlsx](https://github.com/apache/jena/files/14251580/Apache.Jena.Serialization.xlsx)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Additional THRIFT-Format without redundant Strings [jena]

Reply via email to