michdom44 opened a new issue #9985:
URL: https://github.com/apache/druid/issues/9985


   ### Affected Version
   
   Druid 0.12.1
   
   ### Description
   
   
[rec_use_agg_imsi_supervisor.txt](https://github.com/apache/druid/files/4729207/rec_use_agg_imsi_supervisor.txt)
   
[WINPRJ_current_config_2020-06-03_10.40.53.txt](https://github.com/apache/druid/files/4729209/WINPRJ_current_config_2020-06-03_10.40.53.txt)
   
   We have a data source ingested via Kafka (it’s not real-time ingestion but 
only ingestion of about 1 million rows once a day) which contains approximately 
23.3 million rows (see supervisor spec in attachment). Our client wishes to 
extract information via Druid SQL with the following query:
   {
     "query": "select visitedoperator, visitedcountry, visitedmccmnc, count 
(distinct imsi) as count_unique_imsi from (select imsi, visitedoperator, 
visitedcountry, visitedmccmnc, count (flg_isalive) as nb_DayAlive from 
rec_use_agg_imsi where homemccmnc = 20801 and __ = 20801 and -01 'and __time 
<=' 2019-12-31 'group by imsi, visitedoperator, visitedcountry, visitedmccmnc 
having nb_DayAlive> = 1) group by visitedoperator, visitedcountry, 
visitedmccmnc "
   }
   This query works perfectly and returns 20 rows but its execution time is 
around 30 seconds:
   
   # time curl --negotiate -u: -XPOST -H'Content-Type: application / json 
'http: // $ {DRUID_BROKER_HOST}: $ {DRUID_BROKER_PORT} / druid / v2 / sql / -d 
@ longReq.json | jq
     % Total% Received% Xferd Average Speed Time Time Time Current
                                    Dload Upload Total Spent Left Speed
   100 548 100 106 100 442 7857 32762 -: -: - -: -: - -: -: - 34000
   561 442 0 2041 100 442 73 16 0:00:27 0:00:27 -: -: - 352
   
   For information, the embedded “select..group by” returns approximately 3.1 
million rows.
   
   The Druid architecture consists of 2 masters (coordinator and overlord 
processes on each node, 8 vcores and 16 Gb RAM), 2 data (historical and 
middlemanager processes on each node, 32 vcores and 128 Gb RAM) and 2 query 
(broker process on each node, 8 vcores and 16 Gb RAM). Cluster is Kerberized.
   
   I would like to know if it is possible to improve the performance of this 
type of query and how. I have tried to scale up the Druid architecture by 
doubling the number of broker and historical / middlemanager nodes, but I still 
get the same execution time. I also followed the tuning recommendations of the 
Druid documentation about heap and direct memory sizing but without success.
   
   Please find in attachment our Druid cluster configuration.
   
   Thanks for help!
   
   Regards,
   Michel.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to