gianm opened a new pull request, #15681:
URL: https://github.com/apache/druid/pull/15681
Three changes:
1) Reworked FastLineIterator to optionally avoid generating Strings
entirely, and reduce copying somewhat. Benefits the line-oriented
JSON, CSV, delimited (TSV), and regex formats.
2) In the delimited (TSV) format, when the delimiter is a single byte,
split on UTF-8 bytes directly.
3) In CSV and delimited (TSV) formats, use list-based input rows when
the column list is provided upfront by the user.
Benchmarks below. Findings:
- `JsonLineReaderBenchmark` only benefits from change (1), and got a 15%
improvement.
- `DelimitedInputFormatBenchmark` with `fromHeader: true` benefits from (1)
and (2), and got a 22% improvement.
- `DelimitedInputFormatBenchmark` with `fromHeader: false` benefits from all
three changes, and got a 30% improvement.
```
Benchmark (fromHeader) Mode Cnt Score
Error Units
DelimitedInputFormatBenchmark.baseline false avgt 5 1912.257 ±
39.227 us/op [master]
DelimitedInputFormatBenchmark.baseline true avgt 5 1953.915 ±
44.787 us/op [master]
JsonLineReaderBenchmark.baseline avgt 5 2055.294 ±
28.688 us/op [master]
DelimitedInputFormatBenchmark.baseline false avgt 5 1321.142 ±
10.115 us/op [patch]
DelimitedInputFormatBenchmark.baseline true avgt 5 1506.412 ±
15.892 us/op [patch]
JsonLineReaderBenchmark.baseline avgt 5 1734.426 ±
38.518 us/op [patch]
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]