afs commented on issue #1296: URL: https://github.com/apache/jena/issues/1296#issuecomment-1120412294
RIOT has its own tokenizer and parsers - the combination is x2 to x4 faster. The tokenizer is the performance bottleneck. The fastest parsers in Jena run at up to 1m triples/second on binary RDF Thrift. RDF PRotobuf is slightly less than 10% slower (making protobuf work for open ended streams of input seems to create an extra object and at 1microsecond a triple this is observable). The performance of Turtle and N-triples etc is approximately 240 kTPS and 400 kTPS. The only difference is the grammar parser being much simpler than all the "if"s for Turtle. All these are a minimum of x4 faster than Javacc. All parsing performance is sensitive to the hardware used. So these figures are relative. (they are on a old core-I5 with SATA SSD as has been used consistently for measurements over time.) Java has to convert to Java chars at some point which is a copy. In fact, it is faster to convert large buffers using Java built-in UTF-8 handling than to try to do one less copy but of each RDF term. Java checks all input for validity of UTF-8. If you'd like to improve the tokenizer and provide a PR, then would be great. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
