michdom44 opened a new issue #9985: URL: https://github.com/apache/druid/issues/9985
### Affected Version Druid 0.12.1 ### Description [rec_use_agg_imsi_supervisor.txt](https://github.com/apache/druid/files/4729207/rec_use_agg_imsi_supervisor.txt) [WINPRJ_current_config_2020-06-03_10.40.53.txt](https://github.com/apache/druid/files/4729209/WINPRJ_current_config_2020-06-03_10.40.53.txt) We have a data source ingested via Kafka (it’s not real-time ingestion but only ingestion of about 1 million rows once a day) which contains approximately 23.3 million rows (see supervisor spec in attachment). Our client wishes to extract information via Druid SQL with the following query: { "query": "select visitedoperator, visitedcountry, visitedmccmnc, count (distinct imsi) as count_unique_imsi from (select imsi, visitedoperator, visitedcountry, visitedmccmnc, count (flg_isalive) as nb_DayAlive from rec_use_agg_imsi where homemccmnc = 20801 and __ = 20801 and -01 'and __time <=' 2019-12-31 'group by imsi, visitedoperator, visitedcountry, visitedmccmnc having nb_DayAlive> = 1) group by visitedoperator, visitedcountry, visitedmccmnc " } This query works perfectly and returns 20 rows but its execution time is around 30 seconds: # time curl --negotiate -u: -XPOST -H'Content-Type: application / json 'http: // $ {DRUID_BROKER_HOST}: $ {DRUID_BROKER_PORT} / druid / v2 / sql / -d @ longReq.json | jq % Total% Received% Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 548 100 106 100 442 7857 32762 -: -: - -: -: - -: -: - 34000 561 442 0 2041 100 442 73 16 0:00:27 0:00:27 -: -: - 352 For information, the embedded “select..group by” returns approximately 3.1 million rows. The Druid architecture consists of 2 masters (coordinator and overlord processes on each node, 8 vcores and 16 Gb RAM), 2 data (historical and middlemanager processes on each node, 32 vcores and 128 Gb RAM) and 2 query (broker process on each node, 8 vcores and 16 Gb RAM). Cluster is Kerberized. I would like to know if it is possible to improve the performance of this type of query and how. I have tried to scale up the Druid architecture by doubling the number of broker and historical / middlemanager nodes, but I still get the same execution time. I also followed the tuning recommendations of the Druid documentation about heap and direct memory sizing but without success. Please find in attachment our Druid cluster configuration. Thanks for help! Regards, Michel. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
