And thanks very much Andries for the detailed information answering to my
questions. I really appreciated it.

And the tables are stored in HDFS on the EMR cluster, not on S3, and then
loaded into Hive as External tables.


Thanks,
Alex



On Tue, Apr 21, 2015 at 3:08 PM, Andries Engelbrecht <
[email protected]> wrote:

> Alex,
>
> Definitely looks like the majority of time is by far spend on reading the
> Hive data (Hive_Sub_Scan). Not sure how well the storage environment is
> configured, and it may very likely be that the nodes are just waiting on
> storage IO. The more nodes will simply just wait longer to actually get the
> data through the same pipe. Is the data on S3 with an Hive external table
> or similar?
>
> Very short answers, probably will need much more detailed documentation to
> cover these in better detail than a mailing list
> i) Probably will need to get some documentation to clarify that better.
> ii) Typically seconds are x.xxx and minutes are x:xx
> iii) Graphical representation of fragments and time taken in seconds
> iv) Best way for me to describe is that Major Fragment is a query phase
> (operator), and minor fragments are the parallelization of Major fragments,
> that is why some Major Frags only have a single minor frag as it can’t or
> doesn’t make sense to parallelize.
> v) Since you have parallel execution for Major Frags it shows when the
> first execution thread started, when the last one started, and similar for
> ends. Basically just to give you an indication if one of more threads were
> significantly slower than others (important to understand for some
> environments to spot a lagger that holds the whole process up).
> vi) Same logic as v to see the avg execution of all threads and the
> quickest and slowest executors
> vii) The mem is to indicate how much was assigned to a thread to complete
> the operation. A simple scan operation doesn’t require that much memory vs
> joins and other functions than can require much more memory
>
> —Andries
>
>
>
> On Apr 21, 2015, at 2:34 PM, Alexander Zarei <[email protected]>
> wrote:
>
> Hi Team Drill!
>
>
> While performing performance testing on Drill clusters on AWS EMR, with
> TPC-H data of scale factor 100, I observed the results for a cluster of 3
> nodes are similar to a cluster of 13 nodes. Hence, I am investigating how
> the query is being carried out and which part of the query handling (e.g.
> planning, reading the data, executing the query, transferring the record
> batches) is the dominant time consuming part.
>
>
> Parth kindly suggested I should use the Query Profile from the Web UI and
> it helped a lot. However, there are some items on the Query Profile page
> that I did not find documentation to interpret them. I was wondering if you
> know what the following item are:
>
>
> *I)                    What are the meaning of operator types: Project,
> Unordered receiver, Single Sender? I guess Hive sub Scan is the time spent
> reading data from Hive, is that correct?*
>
> *II)                  What are the units for the Processes columns in the
> Operator Profiles Overview table? Is it time in a minutes : seconds format?*
>
>
>
> Also it would be really nice to know:
>
> III)                What metric does the blue chart on the top of the
> Overview section present?
>
> IV)               What is fragment, a minor fragment and major fragment?
>
> V)                 What are the first start and last start and first end
> and last end?
>
> VI)               What are the sets over which max, min and average are
> calculated?
>
> VII)             Why the Peak memory is so small? 4MB while the machine
> has 16 GB of Ram
>
>
> The print of the Web UI as well as the json profile are attached.
>
> Thanks a lot for your time and help.
>
>
> Thanks,
>
> Alex
> <Apache Drill Query Profile.pdf><query profile.json>
>
>
>

Reply via email to