Hi Team Drill!


While performing performance testing on Drill clusters on AWS EMR, with
TPC-H data of scale factor 100, I observed the results for a cluster of 3
nodes are similar to a cluster of 13 nodes. Hence, I am investigating how
the query is being carried out and which part of the query handling (e.g.
planning, reading the data, executing the query, transferring the record
batches) is the dominant time consuming part.



Parth kindly suggested I should use the Query Profile from the Web UI and
it helped a lot. However, there are some items on the Query Profile page
that I did not find documentation to interpret them. I was wondering if you
know what the following item are:



*I)                    What are the meaning of operator types: Project,
Unordered receiver, Single Sender? I guess Hive sub Scan is the time spent
reading data from Hive, is that correct?*

*II)                  What are the units for the Processes columns in the
Operator Profiles Overview table? Is it time in a minutes : seconds format?*



Also it would be really nice to know:

III)                What metric does the blue chart on the top of the
Overview section present?

IV)               What is fragment, a minor fragment and major fragment?

V)                 What are the first start and last start and first end
and last end?

VI)               What are the sets over which max, min and average are
calculated?

VII)             Why the Peak memory is so small? 4MB while the machine has
16 GB of Ram



The print of the Web UI as well as the json profile are attached.

Thanks a lot for your time and help.



Thanks,

Alex

Attachment: query profile.json
Description: application/json

Reply via email to