clintropolis opened a new pull request #11010:
URL: https://github.com/apache/druid/pull/11010
### Description
Expands on the structures added in #10613 to add support for grouping on
string expressions in the vectorized group by engine. The key addition that
makes this possible is
`DictionaryBuildingSingleValueStringGroupByVectorColumnSelector`, which is the
vectorized group by engine version of
`DictionaryBuildingStringGroupByColumnSelectorStrategy`, and allows the vector
group by engine to group on strings which are not dictionary encoded.
To help showcase this, I added vectorization support to the concat operator
`string1 + 'foo'`, and the concat function
`concat(string1,'-',string2,'-',long1)`.
It provides a pretty decent performance increase. From the added benchmark
queries:
```
// 26: group by string expr with non-expr agg
"SELECT CONCAT(string2, '-', long2), SUM(double1) FROM foo GROUP BY 1
ORDER BY 2",
// 27: group by string expr with expr agg
"SELECT CONCAT(string2, '-', long2), SUM(long1 * double4) FROM foo
GROUP BY 1 ORDER BY 2"
```
```
Benchmark (query) (rowsPerSegment) (vectorize)
Mode Cnt Score Error Units
SqlExpressionBenchmark.querySql 26 5000000 false
avgt 5 1601.424 ± 22.075 ms/op
SqlExpressionBenchmark.querySql 26 5000000 force
avgt 5 1017.797 ± 18.384 ms/op
SqlExpressionBenchmark.querySql 27 5000000 false
avgt 5 2072.850 ± 46.369 ms/op
SqlExpressionBenchmark.querySql 27 5000000 force
avgt 5 1072.897 ± 19.756 ms/op
```
Vectorizing additional string expressions I will save for a future PR.
<hr>
##### Key changed/added classes in this PR
* `DictionaryBuildingSingleValueStringGroupByVectorColumnSelector`
* `VectorGroupByEngine`
* `GroupByVectorColumnProcessorFactory`
* `VectorStringProcessors`
* `StringOutMultiStringInVectorProcessor`
<hr>
This PR has:
- [ ] been self-reviewed.
- [ ] added documentation for new or modified features or behaviors.
- [ ] added Javadocs for most classes and all non-trivial methods. Linked
related entities via Javadoc links.
- [ ] added comments explaining the "why" and the intent of the code
wherever would not be obvious for an unfamiliar reader.
- [ ] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
- [ ] been tested in a test Druid cluster.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]