clintropolis opened a new pull request #11010:
URL: https://github.com/apache/druid/pull/11010


   ### Description
   Expands on the structures added in #10613 to add support for grouping on 
string expressions in the vectorized group by engine. The key addition that 
makes this possible is 
`DictionaryBuildingSingleValueStringGroupByVectorColumnSelector`, which is the 
vectorized group by engine version of 
`DictionaryBuildingStringGroupByColumnSelectorStrategy`, and allows the vector 
group by engine to group on strings which are not dictionary encoded.
   
   To help showcase this, I added vectorization support to the concat operator 
`string1 + 'foo'`, and the concat function 
`concat(string1,'-',string2,'-',long1)`.
   
   It provides a pretty decent performance increase. From the added benchmark 
queries:
   
   ```
         // 26: group by string expr with non-expr agg
         "SELECT CONCAT(string2, '-', long2), SUM(double1) FROM foo GROUP BY 1 
ORDER BY 2",
         // 27: group by string expr with expr agg
         "SELECT CONCAT(string2, '-', long2), SUM(long1 * double4) FROM foo 
GROUP BY 1 ORDER BY 2"
   ```
   
   ```
   Benchmark                        (query)  (rowsPerSegment)  (vectorize)  
Mode  Cnt     Score    Error  Units
   SqlExpressionBenchmark.querySql       26           5000000        false  
avgt    5  1601.424 ± 22.075  ms/op
   SqlExpressionBenchmark.querySql       26           5000000        force  
avgt    5  1017.797 ± 18.384  ms/op
   SqlExpressionBenchmark.querySql       27           5000000        false  
avgt    5  2072.850 ± 46.369  ms/op
   SqlExpressionBenchmark.querySql       27           5000000        force  
avgt    5  1072.897 ± 19.756  ms/op
   ```
   
   Vectorizing additional string expressions I will save for a future PR.
   
   <hr>
   
   ##### Key changed/added classes in this PR
    * `DictionaryBuildingSingleValueStringGroupByVectorColumnSelector`
    * `VectorGroupByEngine`
    * `GroupByVectorColumnProcessorFactory`
    * `VectorStringProcessors`
    * `StringOutMultiStringInVectorProcessor`
   
   <hr>
   
   This PR has:
   - [ ] been self-reviewed.
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] been tested in a test Druid cluster.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to