Hi, I am looking for an explanation to a workaround I have found to an issue that has been bugging my team for the past weeks.
Last June we moved from Drill 1.12 to Drill 1.19... Long overdue upgrade! Not long after we started getting the issue described below. We run a query daily on about 410GB of text data spread over ~2200 files, it has a cost of ~22 Billions queued as Large query When the same query runs on 200MB spread over 130 files (same data format), with cost of 36 Millions queued as Large query, it never completes. The small dataset query would stop doing any progress after a few minutes, leaving on for hours, no progress, never complete. The last running fragment is a HASH_PARTITION_SENDER showing 100% fragment time. After much shoot in the dark debugging session, analysing the src data etc. We reviewed our cluster configuration, when we changed exec.queue.threshold=30000000 to 60000000 to categorize our 200MB dataset as a small query, The small dataset query started to work consistently in less than 10 seconds. The physical plan is identical whether the query is Large or Small. Is there a difference internally in drill execution whether the query is small or large? Would you be able to provide an explanation why this workaround works? Cluster detail: 8 drillbits running in kubernetes 16GB direct mem 4GB Heap data stored on a very efficient nfs, exposed as k8s pv/pvc to drillbit pods. Thanks for any insight you can provide on that setting or in regards to our initial problem. François