Small dataset query issue and the workaround we found

François Méthot Thu, 25 Aug 2022 09:15:24 -0700

Hi,

I am looking for an explanation to a workaround I have found to an issue
that has been bugging my team for the past weeks.


Last June we moved from Drill 1.12 to Drill 1.19... Long overdue upgrade!
Not long after we started getting the issue described  below.

We run a query daily on about 410GB of text data spread over  ~2200 files,
it has a cost of ~22 Billions queued as Large query
When the same query runs on 200MB spread over 130 files  (same data
format), with cost of 36 Millions queued as Large query, it never completes.

The small dataset query would stop doing any progress after a few minutes,
leaving on for hours, no progress, never complete.

The last running fragment is a HASH_PARTITION_SENDER showing 100% fragment
time.

After much shoot in the dark debugging session, analysing the src data etc.

We reviewed our cluster configuration,
when we changed
     exec.queue.threshold=30000000 to 60000000
to categorize our 200MB dataset as a small query,

The small dataset query started to work consistently in less than 10
seconds.

The physical plan is identical whether the query is Large or Small.

Is there a difference internally in drill execution whether the query is
small or large?
Would you be able to provide an explanation why this workaround works?

Cluster detail:
8 drillbits running in kubernetes
16GB direct mem
4GB Heap
data stored on a very efficient nfs, exposed as k8s pv/pvc to drillbit pods.

Thanks for any insight you can provide on that setting or in regards to our
initial problem.
François

Small dataset query issue and the workaround we found

Reply via email to