gianm opened a new pull request, #12745:
URL: https://github.com/apache/druid/pull/12745
As we move towards query execution plans that involve more transfer
of data between servers, it's important to have a data format that
provides for doing this more efficiently than the options available to
us today.
This patch adds:
- Columnar frames, which support fast querying. Writes are faster than
on the segment format. Querying is slower than equivalent operations
on the segment format, due to lack of indexes and due to various choices
intended to support fast writes as well reasonably fast reads. Benchmarks
below.
- Row-based frames, which support fast sorting via memory comparison
and fast whole-row copies via memory copying.
- Frame files, a container format that can be stored on disk or
transferred between servers.
The idea is we should use row-based frames when data is expected to
be sorted, and columnar frames when data is expected to be queried.
The code in this patch is not used in production yet. Therefore, the
patch involves minimal changes outside of the `org.apache.druid.frame`
package. The main ones are adjustments to SqlBenchmark to add benchmarks
for queries on frames, and the addition of a "forEach" method to Sequence.
Future patches in the #12262 sequence will use these frames for data
transfer and short-term storage.
Benchmarks for queries on frames vs. traditional segments (mmap):
```
Benchmark (query) (rowsPerSegment) (storageType)
(vectorize) Mode Cnt Score Error Units
SqlBenchmark.querySql 0 2000000 mmap
false avgt 15 6.296 ± 0.081 ms/op
SqlBenchmark.querySql 0 2000000 frame-row
false avgt 15 88.495 ± 0.579 ms/op
SqlBenchmark.querySql 0 2000000 frame-columnar
false avgt 15 13.715 ± 0.562 ms/op
SqlBenchmark.querySql 10 2000000 mmap
false avgt 15 251.530 ± 4.862 ms/op
SqlBenchmark.querySql 10 2000000 frame-row
false avgt 15 626.003 ± 4.862 ms/op
SqlBenchmark.querySql 10 2000000 frame-columnar
false avgt 15 466.353 ± 0.603 ms/op
SqlBenchmark.querySql 18 2000000 mmap
false avgt 15 172.775 ± 0.890 ms/op
SqlBenchmark.querySql 18 2000000 frame-row
false avgt 15 225.835 ± 2.350 ms/op
SqlBenchmark.querySql 18 2000000 frame-columnar
false avgt 15 177.613 ± 1.210 ms/op
Benchmark (query) (rowsPerSegment) (storageType)
(vectorize) Mode Cnt Score Error Units
SqlBenchmark.querySql 0 2000000 mmap
force avgt 15 0.509 ± 0.013 ms/op
SqlBenchmark.querySql 0 2000000 frame-columnar
force avgt 15 7.524 ± 0.123 ms/op
SqlBenchmark.querySql 10 2000000 mmap
force avgt 15 174.626 ± 10.985 ms/op
SqlBenchmark.querySql 10 2000000 frame-columnar
force avgt 15 455.922 ± 19.296 ms/op
SqlBenchmark.querySql 18 2000000 mmap
force avgt 15 38.537 ± 1.182 ms/op
SqlBenchmark.querySql 18 2000000 frame-columnar
force avgt 15 50.755 ± 0.751 ms/op
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]