[jira] [Comment Edited] (DRILL-5528) Sorting 19GB data with 14GB memory in a single fragment takes ~150 minutes

Paul Rogers (JIRA) Thu, 08 Jun 2017 17:42:53 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043607#comment-16043607
 ]


Paul Rogers edited comment on DRILL-5528 at 6/9/17 12:42 AM:
-------------------------------------------------------------

Preliminary results suggest that Drill is severely CPU bound for this query. As 
a baseline, ran the query with the original, "unmanaged", sort.

Running on a Mac, with 3 GB given to the sort, the thread running the query is 
at 100% of CPU. But, disk I/O during the read-sort-and-spill phase is very 
small, only about 40MB/s for reads, 20 MB/s for writes peak, with many valleys 
in between -- on an SSD that can do 700 MB/s. Earlier tests got upwards of 70+ 
MB/s.

Later, during the merge phase, read performance jumps to 150 MB/s. CPU is still 
around 95%.

Total run time was just over an hour for a file that formerly took just a few 
minutes on previous test variations.

This suggests that, somehow, something is using far to much CPU to do the work 
of the sort. This also suggests that the excessive CPU use occurs whether the 
sort memory is "managed" or not.

Time summary:

{code}
Results: 1 records, 2 batches, 3,608,140 ms

00 Screen
  Setup:        0 ms -   0%,   0%
  Process:      4 ms -   0%,   0%
01 Project
  Setup:        5 ms -   0%,   0%
  Process:      1 ms -   0%,   0%
02 StreamAgg (STREAMING AGGREGATE)
  Setup:      133 ms -  11%,   0%
  Process:      4 ms -   0%,   0%
03 Project
  Setup:       53 ms -   4%,   0%
  Process:      0 ms -   0%,   0%
04 SelectionVectorRemover (SELECTION VECTOR REMOVER)
  Setup:       51 ms -   4%,   0%
  Process:    501 ms -   0%,   0%
05 Filter
  Setup:      194 ms -  17%,   0%
  Process: 14,910 ms -   0%,   0%
06 Project
  Setup:       41 ms -   3%,   0%
  Process:  1,414 ms -   0%,   0%
07 SelectionVectorRemover (SELECTION VECTOR REMOVER)
  Setup:        6 ms -   0%,   0%
  Process: 33,020 ms -   1%,   1%
08 Sort (EXTERNAL SORT)
  Setup:        0 ms -   0%,   0%
  Process: 1,996,522 ms -  87%,  87%
09 Project
  Setup:      137 ms -  12%,   0%
  Process: 24,847 ms -   1%,   1%
10 Project
  Setup:      499 ms -  44%,   0%
  Process: 45,571 ms -   2%,   2%
11 Scan (TEXT SUB SCAN)
  Setup:        0 ms -   0%,   0%
  Process: 153,725 ms -   6%,   6%
Total:
  Setup:    1,119 ms
  Process: 2,270,519 ms
Input batches: 0, spills: 58, merge/spills: 0
{code}

In summary form:

{code}
08 Sort (EXTERNAL SORT)
  Process: 1,996,522 ms -  87%,  87%
10 Project
  Process: 45,571 ms -   2%,   2%
11 Scan (TEXT SUB SCAN)
  Process: 153,725 ms -   6%,   6%
{code}

The sort itself takes 87% of the time. Reading the original 18 GB of data takes 
just 6% of the time. The read rate during the scan is about 18 GB / 154 s = 116 
MB/s, which is pretty good. So, clearly the delay is in the sort itself 
somewhere.

For a baseline, ran a query that simply reads and discards data:

{code}
select * FROM (select * from dfs.data.`250wide.tbl`)d where d.columns[0] = 
'ljdfhwuehnoiueyf'
{code}

Ran in just 178 seconds. This confirms that the scan part of the query is fine. 
Read rate starts at 200 MB/s, then drops to 100 MB/s. CPU is at 97% suggesting 
that the query, even for a simple read, is CPU bound rather than disk bound. 
Scan time is 139 seconds or about 130 MB/s read rate.

Repeating the original query on the "managed" sort version shows an interesting 
pattern. The first few minutes are identical to the "simple case": 200 MBs/, 
then 100 MB/s read rate. Once spilling kicks in, the read rate drops to 20 MB/s 
and the write becomes around 30 MB/s. CPU stays pegged at 100%. So, managing 
memory better didn't magically solve the CPU bound issue, unfortunately.

One improvement: during the merge phase, the read rate rises to 120-180 MB/s: 
far faster than the old version. This is likely mostly due to the buffering 
improvements in the spill file read path in the new version.

Results with the managed sort:

{code}
Results: 1 records, 2 batches, 1,956,403 ms
...
07 SelectionVectorRemover (SELECTION VECTOR REMOVER)
  Process:  7,919 ms -   0%,   0%
08 Sort (EXTERNAL SORT)
  Process: 1,676,093 ms -  86%,  86%
09 Project
  Process: 24,957 ms -   1%,   1%
10 Project
  Process: 46,844 ms -   2%,   2%
11 Scan (TEXT SUB SCAN)
  Process: 164,290 ms -   8%,   8%
Total:
  Setup:      652 ms
  Process: 1,934,765 ms
Input batches: 0, spills: 155, merge/spills: 4
{code}

Performance improved from 3600 seconds to 1950 seconds. The new time is 54% 
(about half) of the original time.

There is opportunity for improvement: the new version spilled more: 155 vs. 58.


was (Author: paul-rogers):
Preliminary results suggest that Drill is severely CPU bound for this query. As 
a baseline, ran the query with the original, "unmanaged", sort.

Running on a Mac, with 3 GB given to the sort, the thread running the query is 
at 100% of CPU. But, disk I/O during the read-sort-and-spill phase is very 
small, only about 40MB/s for reads, 20 MB/s for writes peak, with many valleys 
in between -- on an SSD that can do 700 MB/s. Earlier tests got upwards of 70+ 
MB/s.

Later, during the merge phase, read performance jumps to 150 MB/s. CPU is still 
around 95%.

Total run time was just over an hour for a file that formerly took just a few 
minutes on previous test variations.

This suggests that, somehow, something is using far to much CPU to do the work 
of the sort. This also suggests that the excessive CPU use occurs whether the 
sort memory is "managed" or not.

Time summary:

{code}
Results: 1 records, 2 batches, 3,608,140 ms

00 Screen
  Setup:        0 ms -   0%,   0%
  Process:      4 ms -   0%,   0%
01 Project
  Setup:        5 ms -   0%,   0%
  Process:      1 ms -   0%,   0%
02 StreamAgg (STREAMING AGGREGATE)
  Setup:      133 ms -  11%,   0%
  Process:      4 ms -   0%,   0%
03 Project
  Setup:       53 ms -   4%,   0%
  Process:      0 ms -   0%,   0%
04 SelectionVectorRemover (SELECTION VECTOR REMOVER)
  Setup:       51 ms -   4%,   0%
  Process:    501 ms -   0%,   0%
05 Filter
  Setup:      194 ms -  17%,   0%
  Process: 14,910 ms -   0%,   0%
06 Project
  Setup:       41 ms -   3%,   0%
  Process:  1,414 ms -   0%,   0%
07 SelectionVectorRemover (SELECTION VECTOR REMOVER)
  Setup:        6 ms -   0%,   0%
  Process: 33,020 ms -   1%,   1%
08 Sort (EXTERNAL SORT)
  Setup:        0 ms -   0%,   0%
  Process: 1,996,522 ms -  87%,  87%
09 Project
  Setup:      137 ms -  12%,   0%
  Process: 24,847 ms -   1%,   1%
10 Project
  Setup:      499 ms -  44%,   0%
  Process: 45,571 ms -   2%,   2%
11 Scan (TEXT SUB SCAN)
  Setup:        0 ms -   0%,   0%
  Process: 153,725 ms -   6%,   6%
Total:
  Setup:    1,119 ms
  Process: 2,270,519 ms
Input batches: 0, spills: 58, merge/spills: 0
{code}

In summary form:

{code}
08 Sort (EXTERNAL SORT)
  Process: 1,996,522 ms -  87%,  87%
10 Project
  Process: 45,571 ms -   2%,   2%
11 Scan (TEXT SUB SCAN)
  Process: 153,725 ms -   6%,   6%
{code}

The sort itself takes 87% of the time. Reading the original 18 GB of data takes 
just 6% of the time. The read rate during the scan is about 18 GB / 154 s = 116 
MB/s, which is pretty good. So, clearly the delay is in the sort itself 
somewhere.

For a baseline, ran a query that simply reads and discards data:

{code}
select * FROM (select * from dfs.data.`250wide.tbl`)d where d.columns[0] = 
'ljdfhwuehnoiueyf'
{code}

Ran in just 178 seconds. This confirms that the scan part of the query is fine. 
Read rate starts at 200 MB/s, then drops to 100 MB/s. CPU is at 97% suggesting 
that the query, even for a simple read, is CPU bound rather than disk bound. 
Scan time is 139 seconds or about 130 MB/s read rate.

Repeating the original query on the "managed" sort version shows an interesting 
pattern. The first few minutes are identical to the "simple case": 200 MBs/, 
then 100 MB/s read rate. Once spilling kicks in, the read rate drops to 20 MB/s 
and the write becomes around 30 MB/s. CPU stays pegged at 100%. So, managing 
memory better didn't magically solve the CPU bound issue, unfortunately.

> Sorting 19GB data with 14GB memory in a single fragment takes ~150 minutes
> --------------------------------------------------------------------------
>
>                 Key: DRILL-5528
>                 URL: https://issues.apache.org/jira/browse/DRILL-5528
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators
>    Affects Versions: 1.10.0
>            Reporter: Rahul Challapalli
>            Assignee: Paul Rogers
>         Attachments: 26e2183f-2c5d-4044-0e23-4aeb45c40be5.sys.drill, 
> drillbit.log, drillbit.out, drill-env.sh
>
>
> Configuration :
> {code}
> git.commit.id.abbrev=1e0a14c
> DRILL_MAX_DIRECT_MEMORY="32G"
> DRILL_MAX_HEAP="4G"
> {code}
> Based on the runtime of the below query, I suspect there is a performance 
> bottleneck somewhere
> {code}
> [root@qa-node190 external-sort]# /opt/drill/bin/sqlline -u 
> jdbc:drill:zk=10.10.100.190:5181
> apache drill 1.11.0-SNAPSHOT
> "start your sql engine"
> 0: jdbc:drill:zk=10.10.100.190:5181> ALTER SESSION SET 
> `exec.sort.disable_managed` = false;
> +-------+-------------------------------------+
> |  ok   |               summary               |
> +-------+-------------------------------------+
> | true  | exec.sort.disable_managed updated.  |
> +-------+-------------------------------------+
> 1 row selected (0.975 seconds)
> 0: jdbc:drill:zk=10.10.100.190:5181> alter session set 
> `planner.width.max_per_node` = 1;
> +-------+--------------------------------------+
> |  ok   |               summary                |
> +-------+--------------------------------------+
> | true  | planner.width.max_per_node updated.  |
> +-------+--------------------------------------+
> 1 row selected (0.371 seconds)
> 0: jdbc:drill:zk=10.10.100.190:5181> alter session set 
> `planner.disable_exchanges` = true;
> +-------+-------------------------------------+
> |  ok   |               summary               |
> +-------+-------------------------------------+
> | true  | planner.disable_exchanges updated.  |
> +-------+-------------------------------------+
> 1 row selected (0.292 seconds)
> 0: jdbc:drill:zk=10.10.100.190:5181> alter session set 
> `planner.memory.max_query_memory_per_node` = 14106127360;
> +-------+----------------------------------------------------+
> |  ok   |                      summary                       |
> +-------+----------------------------------------------------+
> | true  | planner.memory.max_query_memory_per_node updated.  |
> +-------+----------------------------------------------------+
> 1 row selected (0.316 seconds)
> 0: jdbc:drill:zk=10.10.100.190:5181> select count(*) from (select * from 
> dfs.`/drill/testdata/resource-manager/250wide.tbl` order by columns[0])d 
> where d.columns[0] = 'ljdfhwuehnoiueyf';
> +---------+
> | EXPR$0  |
> +---------+
> | 0       |
> +---------+
> 1 row selected (8530.719 seconds)
> {code}
> I attached the logs and profile files. The data is too large to attach to a 
> jira. Reach out to me if you need any more information



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (DRILL-5528) Sorting 19GB data with 14GB memory in a single fragment takes ~150 minutes

Reply via email to