arne-bdt commented on issue #2240: URL: https://github.com/apache/jena/issues/2240#issuecomment-1939717414
Thank you very much for all your feedback. @rvesse, regarding your attack scenario where the server could run out of memory due to storing all strings until the last result row is sent: that is indeed a valid concern. This vulnerability might make the format less suitable for public SPARQL servers. However, within our cluster, for service-to-service and service-to-UI client communications, it could serve as an effective bandwidth-saving alternative. Regarding your concerns about the in-memory measurement, since bandwidth and I/O performance vary greatly across different scenarios, I thought it prudent to exclude these factors from the performance measurements. Are you suggesting that different disk flushing strategies might not have been adequately considered? @afs, it appears that the Protobuf and Thrift formats do not utilize ParserProfileStd, which uses CacheCaffeine with a default cache size of 500. Instead, they employ CacheSimple with a default cache size of 5000. The rationale behind the use of two different LRU cache implementations and varying default cache sizes is unclear to me. Neither Protobuf nor Thrift seems to utilize FactoryRDFCaching, and no caching is applied to results, only to graphs. --> I could try to use the IRI Cache for the results too and try to measure the impact. Would that be helpful? --> I could try to use FactoryRDFCaching#createTypedLiteral for the graph and result deserialization and measure the impact. Would that be helpful? The idea of a JVM-wide term cache is intriguing, but I am uncertain about its potential to slow down the system under heavy multithreaded workloads. An alternative could be making it optionally thread-local. Another concern is determining the optimal size for such a cache. ### Back to my current approach: Introducing another data format is not my intention. To identify the optimal format for our needs, we conducted extensive benchmarks across various serialization formats, combined with different compression algorithms and levels. We also considered the frequency of graph writes versus reads and the implications for disk space over time. For our use case, RDF_THRIFT_VALUES (soon to be RDF_THRIFT) combined with LZ4_fastest emerged as the best choice. Our active datasets are maintained as in-memory graphs to ensure peak performance, with only changes being written as deltas into the RDBMS, following a CQRS with event sourcing pattern. A significant bottleneck we've identified is the substantial data traffic generated by services executing numerous parallel queries. Upon examining the THRIFT results format, I noticed that all data fields contained strings, with many repetitions. In my initial implementation, I simply adapted the THRIFT format and code, substituting all strings with i32 indices and adding string lists to the rows. The modifications to the code were minimal. For deserialization in Java, the fastest approach seems to be using a single ArrayList<String>, adding new strings with #addAll, and retrieving corresponding strings with #get(index). The downside is that the consumer must retain the expanding string list in memory until the stream is processed. The upside is its simplicity and straightforwardness, with an equivalent to ArrayList<String> likely available in other programming languages. THRIFT2 is a variation of THRIFT:  I've conducted JMH-benchmarks on several graphs that I had access to: - pizza.owl.rdf and cheese.ttl (I sometimes left these out of some benchmarks due to their relatively small size) - RealGrid_EQ.xml, RealGrid_SSH.xml, RealGrid_TP.xml, RealGrid_SV.xml (these are sample datasets from the CGMES conformity assessment, available for download at [ENTSO-E] - https://www.entsoe.eu/Documents/CIM_documents/Grid_Model_CIM/TestConfigurations_packageCASv2.0.zip)) xxx_CGMES_EQ.xml, xxx_CGMES_SSH.xml, xxx_CGMES_TP.xml (these are real datasets from one of our projects and are not publicly available) - bsbm-1m.nt.gz Each of these datasets, especially the EQ, SSH, TP, and SV graphs, has a unique structure, leading to varying distributions of subjects, predicates, and objects. Since additional compression often seems to be relevant, I measured uncompressed, GZIP-compressed and LZ4Fastest-compressed. Regarding the SPARQL query results, I opted for a straightforward "SELECT ?s ?p ?o FROM ..." query. Based on my observations, I don't believe the choice of query would significantly alter the outcomes. Here I compared THRIFT with my implementation (currently named THRIFT2): **Speed of a roundtrip** The simplest approach to measure speed is a simple roundtrip of serialization and deserialization. If the producer has significantly more or less resources than the consumer, this perspective is, of course, too simplified. If data is written more often than read, or the other way around, with other formats, this perspective is too simplified.  --> For SPARQL-results, THRIFT2 outperforms THRIFT in all cases and with all tested compressions.  --> For graphs, THRIFT2 outperforms THRIFT in combination with compression but not uncompressed. **Size** For the size, I simply calculated the number of triples per MB:  --> Uncompressed THRIFT2 SPARQL-results are in all cases at least 50% smaller than uncompressed THRIFT.  --> Uncompressed THRIFT2 graphs are in all cases at least 50% smaller than uncompressed THRIFT. **Speed of serialization**   --> Overall, THRIFT2 is faster than THRIFT during serialization. **Speed of deserialization**  --> In combination with compression, THRIFT2 outperforms THRIFT for SPARQL-results. Uncompressed, there is no clear winner.  --> For graphs, THRIFT2 outperforms THRIFT in combination with compression but not uncompressed. **Questions** What do you think? Are the advantages shown in the benchmarks too small to justify an additional format? Should I move forward with the THRIFT variant, or maybe only with the results format but not with the graphs format? Would it be appropriate to move THRIFT2 (or whatever name it will have) to a separate package, to avoid introducing yet another format? ### Benchmarks for SPARQL-result serialization/deserialization Since I've conducted some benchmarks, it might be interesting to compare all available formats here. Note: I tried the same approach with PROTO, but Protobuf itself seems to behave differently when deserializing string lists. Deserialization in PROTO2 is, in my opinion, too slow to be considered an option. **Size**    **Speed of a roundtrip**    **Speed of serialization**    **Speed of deserialization**    ### Benchmarks for graph serialization/deserialization **Size**    **Speed of a roundtrip**    **Speed of serialization**    **Speed of deserialization**    My excel file with all the numbers: [Apache Jena Serialization.xlsx](https://github.com/apache/jena/files/14251580/Apache.Jena.Serialization.xlsx) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
