[jira] [Updated] (DRILL-7720) Issue observed in performance of UNION ALL between Parquet and DB query

Sreeparna Bhabani (Jira) Mon, 27 Apr 2020 02:31:42 -0700


     [ 
https://issues.apache.org/jira/browse/DRILL-7720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sreeparna Bhabani updated DRILL-7720:
-------------------------------------
    Description: 
{color:#26282a}Consider the below scenarios. The first 2 scenarios are giving 
expected results in terms of performance. But we are not getting expected 
performance for 3rd scenario which is UNION ALL with 2 different types of 
datasets (Parquet + DB).{color}

{color:#26282a} {color}

*{color:#26282a}Scenario 1- Parquet UNION ALL Parquet{color}*

{color:#26282a}Individual execution time of 1st query - 5 secs{color}

{color:#26282a}Individual execution time of 2nd query - 5 secs{color}

{color:#26282a}UNION ALL of both queries execution time - 10 secs{color}

{color:#26282a} {color}

*{color:#26282a}Scenario 2 - DB query UNION ALL DB{color}*{color:#26282a} 
*query*{color}

{color:#26282a}Individual execution time of 1st query - 5 secs{color}

{color:#26282a}Individual execution time of 2nd query - 5 secs{color}

{color:#26282a}UNION ALL of both queries execution time - 10 secs{color}

{color:#26282a} {color}

*{color:#26282a}Scenario 3 - Parquet UNION ALL DB query{color}*

{color:#26282a}Individual execution time of 1st query - 5 secs{color}

{color:#26282a}Individual execution time of 2nd query - 1 sec{color}

{color:#26282a}UNION ALL execution time - 20 secs{color}

{color:#26282a}Ideally the execution time should not be more than 6 secs.{color}

 

{color:#26282a}Config-{color}

{color:#26282a}HEAP memory - 16 GB{color}

{color:#26282a}DRILL_MAX_DIRECT_MEMORY{color} - 32 GB

2 Drillbits

 

Observation-

Observed that the query is distributed in 2 NODES when we are executing 
individual query or executing UNION ALL between same type datasets. But query 
is executing only on 1 NODE when we are executing UNION ALL between 2 types 
datasets (like Parquet UNION ALL DB). The Union query is not being parallelized 
i.e. split into multiple 'Minor Fragments'

 

Storage-

Storage is HDFS.

 

Parquet file size - 849 MB

 

Nature of query-

Both Parquet and DB query have some filter criteria. Those doesn't have sort or 
join.

 

Time taken-
| |SCAN|Total|
|Parquet|2.018s|5.419 sec|
|DB|0.146s|0.257 sec|
|Parquet UNION ALL DB|15.632s|20.729 sec|

  was:
{color:#26282a}Consider the below scenarios. The first 2 scenarios are giving 
expected results in terms of performance. But we are not getting expected 
performance for 3rd scenario which is UNION ALL with 2 different types of 
datasets (Parquet + DB).{color}

{color:#26282a} {color}

*{color:#26282a}Scenario 1- Parquet UNION ALL Parquet{color}*

{color:#26282a}Individual execution time of 1st query - 5 secs{color}

{color:#26282a}Individual execution time of 2nd query - 5 secs{color}

{color:#26282a}UNION ALL of both queries execution time - 10 secs{color}

{color:#26282a} {color}

*{color:#26282a}Scenario 2 - DB query UNION ALL DB{color}*{color:#26282a} 
*query*{color}

{color:#26282a}Individual execution time of 1st query - 5 secs{color}

{color:#26282a}Individual execution time of 2nd query - 5 secs{color}

{color:#26282a}UNION ALL of both queries execution time - 10 secs{color}

{color:#26282a} {color}

*{color:#26282a}Scenario 3 - Parquet UNION ALL DB query{color}*

{color:#26282a}Individual execution time of 1st query - 5 secs{color}

{color:#26282a}Individual execution time of 2nd query - 1 sec{color}

{color:#26282a}UNION ALL execution time - 20 secs{color}

{color:#26282a}Ideally the execution time should not be more than 6 secs.{color}

 

{color:#26282a}Config-{color}

{color:#26282a}HEAP memory - 16 GB{color}

{color:#26282a}DRILL_MAX_DIRECT_MEMORY{color} - 32 GB

2 Drillbits

 

Observation-

Observed that the query is distributed in 2 NODES when we are executing 
individual query or executing UNION ALL between same type datasets. But query 
is executing only on 1 NODE when we are executing UNION ALL between 2 types 
datasets (like Parquet UNION ALL DB). The Union query is not being parallelized 
i.e. split into multiple 'Minor Fragments'

 

Nature of query-

Both Parquet and DB query have some filter criteria. Those doesn't have sort or 
join.
| |SCAN|Total|
|Parquet|2.018s|5.419 sec|
|DB|0.146s|0.257 sec|
|Parquet UNION ALL DB|15.632s|20.729 sec|


> Issue observed in performance of UNION ALL between Parquet and DB query
> -----------------------------------------------------------------------
>
>                 Key: DRILL-7720
>                 URL: https://issues.apache.org/jira/browse/DRILL-7720
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Functions - Drill
>    Affects Versions: 1.17.0
>            Reporter: Sreeparna Bhabani
>            Priority: Major
>
> {color:#26282a}Consider the below scenarios. The first 2 scenarios are giving 
> expected results in terms of performance. But we are not getting expected 
> performance for 3rd scenario which is UNION ALL with 2 different types of 
> datasets (Parquet + DB).{color}
> {color:#26282a} {color}
> *{color:#26282a}Scenario 1- Parquet UNION ALL Parquet{color}*
> {color:#26282a}Individual execution time of 1st query - 5 secs{color}
> {color:#26282a}Individual execution time of 2nd query - 5 secs{color}
> {color:#26282a}UNION ALL of both queries execution time - 10 secs{color}
> {color:#26282a} {color}
> *{color:#26282a}Scenario 2 - DB query UNION ALL DB{color}*{color:#26282a} 
> *query*{color}
> {color:#26282a}Individual execution time of 1st query - 5 secs{color}
> {color:#26282a}Individual execution time of 2nd query - 5 secs{color}
> {color:#26282a}UNION ALL of both queries execution time - 10 secs{color}
> {color:#26282a} {color}
> *{color:#26282a}Scenario 3 - Parquet UNION ALL DB query{color}*
> {color:#26282a}Individual execution time of 1st query - 5 secs{color}
> {color:#26282a}Individual execution time of 2nd query - 1 sec{color}
> {color:#26282a}UNION ALL execution time - 20 secs{color}
> {color:#26282a}Ideally the execution time should not be more than 6 
> secs.{color}
>  
> {color:#26282a}Config-{color}
> {color:#26282a}HEAP memory - 16 GB{color}
> {color:#26282a}DRILL_MAX_DIRECT_MEMORY{color} - 32 GB
> 2 Drillbits
>  
> Observation-
> Observed that the query is distributed in 2 NODES when we are executing 
> individual query or executing UNION ALL between same type datasets. But query 
> is executing only on 1 NODE when we are executing UNION ALL between 2 types 
> datasets (like Parquet UNION ALL DB). The Union query is not being 
> parallelized i.e. split into multiple 'Minor Fragments'
>  
> Storage-
> Storage is HDFS.
>  
> Parquet file size - 849 MB
>  
> Nature of query-
> Both Parquet and DB query have some filter criteria. Those doesn't have sort 
> or join.
>  
> Time taken-
> | |SCAN|Total|
> |Parquet|2.018s|5.419 sec|
> |DB|0.146s|0.257 sec|
> |Parquet UNION ALL DB|15.632s|20.729 sec|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (DRILL-7720) Issue observed in performance of UNION ALL between Parquet and DB query

Reply via email to