gianm opened a new pull request, #15693: URL: https://github.com/apache/druid/pull/15693
These readers were running UTF-8 decode on the provided entity to convert it to a String, then parsing the String as JSON. The patch changes them to parse the provided entity's input stream directly. In order to preserve the nice error messages that include parse errors, the readers now need to open the entity again on the error path, to re-read the data. To make this possible, the InputEntity#open contract is tightened to require the ability to re-open entities, and existing InputEntity implementations are updated to allow re-opening. This patch also renames JsonLineReaderBenchmark to JsonInputFormatBenchmark, updates it to benchmark all three JSON readers, and adds a case that reads fields out of the parsed row (not just creates it). Benchmarks below. Findings: - The `reader` and `node_reader` (used if `useJsonNodeReader` is set) readers are ~15% faster on `parseAndRead`. `reader` is the default for stream ingest; `node_reader` is used for stream ingest if `useJsonNodeReader` is set. - The `line_reader` wasn't changed in this patch and performance is the same (within margin of error). This one is default for batch ingest, and used for streaming if `assumeNewlineDelimited` is set. So, the speedups are mainly for streaming ingest. But #15681 has a similar speedup for `line_reader` if that's the one you care about! ``` master Benchmark (readerTypeString) Mode Cnt Score Error Units JsonInputFormatBenchmark.parseAndRead reader avgt 5 3148.287 ± 117.748 us/op JsonInputFormatBenchmark.parseAndRead node_reader avgt 5 3232.287 ± 20.667 us/op JsonInputFormatBenchmark.parseAndRead line_reader avgt 5 3085.638 ± 45.131 us/op patch Benchmark (readerTypeString) Mode Cnt Score Error Units JsonInputFormatBenchmark.parseAndRead reader avgt 5 2656.737 ± 65.348 us/op JsonInputFormatBenchmark.parseAndRead node_reader avgt 5 2659.078 ± 53.231 us/op JsonInputFormatBenchmark.parseAndRead line_reader avgt 5 3017.010 ± 63.724 us/op ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
