clintropolis commented on pull request #12277: URL: https://github.com/apache/druid/pull/12277#issuecomment-1055789995
Some additional less scientific measurements, using a 10GB file of the nyc taxi dataset with all of the columns stored as strings: <img width="543" alt="Screen Shot 2022-02-27 at 6 21 58 PM" src="https://user-images.githubusercontent.com/1577461/156236031-1ac864f3-e382-424b-b547-0693cbd73e30.png"> grouping performance seems competitive: <img width="1415" alt="Screen Shot 2022-02-28 at 1 26 06 PM" src="https://user-images.githubusercontent.com/1577461/156236107-ab93a26d-0fae-49b0-bd19-17ab7eaf26ad.png"> <img width="1408" alt="Screen Shot 2022-02-28 at 1 25 51 PM" src="https://user-images.githubusercontent.com/1577461/156236156-90389487-49cb-4ded-bce1-5c5b58fd759d.png"> select * does show a performance decrease as the earlier benchmarks suggested: <img width="1403" alt="Screen Shot 2022-02-27 at 6 28 15 PM" src="https://user-images.githubusercontent.com/1577461/156236275-9436433a-3101-4481-988a-50c17cbc1434.png"> <img width="1406" alt="Screen Shot 2022-02-27 at 6 27 41 PM" src="https://user-images.githubusercontent.com/1577461/156236290-b604a6fb-a93b-4e9a-a228-724816536c1e.png"> I still haven't had the chance to spend any time optimizing the code, but the size savings definitely make this feel worth considering for clusters where the typical workload does not include queries which hit a lot of columns like "wide" scans ("select *", etc) or group bys or things that hit a large number of columns. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
