[GitHub] [jena] afs commented on issue #1296: StreamRDFWriter getWriterStream()

GitBox Sun, 08 May 2022 05:44:47 -0700


afs commented on issue #1296:
URL: https://github.com/apache/jena/issues/1296#issuecomment-1120412294


   RIOT has its own tokenizer and parsers - the combination is x2 to x4 faster. 
The tokenizer is the performance bottleneck.
   
   The fastest parsers in Jena run at up to 1m triples/second on binary RDF 
Thrift. RDF PRotobuf is slightly less than 10% slower (making protobuf work for 
open ended streams of input seems to create an extra object and at 1microsecond 
a triple this is observable).
   
   The performance of Turtle and N-triples etc is approximately 240 kTPS and 
400 kTPS. The only difference is the grammar parser being much simpler than all 
the "if"s for Turtle.
   
   All these are a minimum of x4 faster than Javacc.
   
   All parsing performance is sensitive to the hardware used. So these figures 
are relative. (they are on a old core-I5 with SATA SSD as has been used 
consistently for measurements over time.)
   
   Java has to convert to Java chars at some point which is a copy. In fact, it 
is faster to convert large buffers using Java built-in UTF-8 handling than to 
try to do one less copy but of each RDF term. Java checks all input for 
validity of UTF-8.
   
   If you'd like to improve the tokenizer and provide a PR, then would be great.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [jena] afs commented on issue #1296: StreamRDFWriter getWriterStream()

Reply via email to