[
https://issues.apache.org/jira/browse/DRILL-7720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sreeparna Bhabani updated DRILL-7720:
-------------------------------------
Description:
{color:#26282a}Consider the below scenarios. The first 2 scenarios are giving
expected results in terms of performance. But we are not getting expected
performance for 3rd scenario which is UNION ALL with 2 different types of
datasets (Parquet + DB).{color}
{color:#26282a} {color}
*{color:#26282a}Scenario 1- Parquet UNION ALL Parquet{color}*
{color:#26282a}Individual execution time of 1st query - 5 secs{color}
{color:#26282a}Individual execution time of 2nd query - 5 secs{color}
{color:#26282a}UNION ALL of both queries execution time - 10 secs{color}
{color:#26282a} {color}
*{color:#26282a}Scenario 2 - DB query UNION ALL DB{color}*{color:#26282a}
*query*{color}
{color:#26282a}Individual execution time of 1st query - 5 secs{color}
{color:#26282a}Individual execution time of 2nd query - 5 secs{color}
{color:#26282a}UNION ALL of both queries execution time - 10 secs{color}
{color:#26282a} {color}
*{color:#26282a}Scenario 3 - Parquet UNION ALL DB query{color}*
{color:#26282a}Individual execution time of 1st query - 5 secs{color}
{color:#26282a}Individual execution time of 2nd query - 1 sec{color}
{color:#26282a}UNION ALL execution time - 20 secs{color}
{color:#26282a}Ideally the execution time should not be more than 6 secs.{color}
{color:#26282a}Config-{color}
{color:#26282a}HEAP memory - 16 GB{color}
{color:#26282a}DRILL_MAX_DIRECT_MEMORY{color} - 32 GB
2 Drillbits
Observation-
Observed that the query is distributed in 2 NODES when we are executing
individual query or executing UNION ALL between same type datasets. But query
is executing only on 1 NODE when we are executing UNION ALL between 2 types
datasets (like Parquet UNION ALL DB). The Union query is not being parallelized
i.e. split into multiple 'Minor Fragments'
Storage-
Storage is HDFS.
Parquet file size - 849 MB
Nature of query-
Both Parquet and DB query have some filter criteria. Those doesn't have sort or
join.
Time taken-
|SCAN|Total|
|Parquet|2.018s|5.419 sec|
|DB|0.146s|0.257 sec|
|Parquet UNION ALL DB|15.632s|20.729 sec|
The query we used-
SELECT column1, column2,...column25
FROM dfs.`root`.`Parquet`
WHERE column1 = <> AND column2 = <> AND column3 = <>
UNION ALL
SELECT column1, column2,...column25
FROM db.`Database`
WHERE column1 = <> AND column2 = <> AND column3 = <>
Below is the Plan after redacted names-
00-00 Screen
00-01 Project(Column1=[$0],...,Column25=[$24])
00-02 UnionExchange
01-01 UnionAll(all=[true])
01-03 Project (Column1=[$0],...,Column25=[$24])
01-05 SelectionVectorRemover
01-06 Filter(condition=[AND(=$0,'<value>',),=($1,'<value>',=($2,'<value>'))])
01-07 Scan(table=[[]], groupscan=[ParquetGroupScan], selectionroot=<>,
numFiles=13, numRowGroups=69, usedMetadatafile=true, usedMetastore=false,
filter=booleanAnd(..), cacheFileRoot=.., columns=[.....] )
01-02 Project(Column1=[$0],...,Column25=[$24])
01-04 Jdbc()
json-
{
"head" : {
"version":1,
"generator" : \{"type":"ExplainHandler","info":""},
"type" : "APACHE_DRILL_PHYSICAL","options" :
[],"queue":0,"hasResourcePlan":false,"resultMode":"EXEC"
},
"graph":[
{"pop":"jdbc-scan"},
{"pop":"project","initialAllocation""
1000000,"maxAllocation":10000000000,"cost":\{memoryCost}:4194304,"outputRowCount":1},
{"pop":"parquet-scan",
"cost":\{"memoryCost":4194304,"outputRowCount":1.2287038E7} },
{"pop":"filter" ,"initialAllocation""
1000000,"maxAllocation":10000000000,"cost":\{"memoryCost":4194304,"outputRowCount":41468.75}
},
{"pop":"selection-vector-remover" ,"initialAllocation""
1000000,"maxAllocation":10000000000,"cost":\{memoryCost}:4194304,"outputRowCount":41468.75
} ,
{"pop":"project" ,"initialAllocation""
1000000,"maxAllocation":10000000000,"cost":\{"memoryCost":4194304,"outputRowCount":41468.75}
}
{"pop":"union-all" ,"initialAllocation""
1000000,"maxAllocation":10000000000,"cost":\{"memoryCost":4194304,"outputRowCount":41468.75}
} ,
{"pop":"union-exchange" ,"initialAllocation""
1000000,"maxAllocation":10000000000,"cost":\{"memoryCost":4194304,"outputRowCount":41468.75
} } ,
{"pop":"project" ,"initialAllocation""
1000000,"maxAllocation":10000000000,"cost":\{"memoryCost":4194304,"outputRowCount":41468.75},"outputProj":true
},
{"pop":"screen" ,"initialAllocation""
1000000,"maxAllocation":10000000000,"cost":\{"memoryCost":4194304,"outputRowCount":41468.75}
}
]
}
was:
{color:#26282a}Consider the below scenarios. The first 2 scenarios are giving
expected results in terms of performance. But we are not getting expected
performance for 3rd scenario which is UNION ALL with 2 different types of
datasets (Parquet + DB).{color}
{color:#26282a} {color}
*{color:#26282a}Scenario 1- Parquet UNION ALL Parquet{color}*
{color:#26282a}Individual execution time of 1st query - 5 secs{color}
{color:#26282a}Individual execution time of 2nd query - 5 secs{color}
{color:#26282a}UNION ALL of both queries execution time - 10 secs{color}
{color:#26282a} {color}
*{color:#26282a}Scenario 2 - DB query UNION ALL DB{color}*{color:#26282a}
*query*{color}
{color:#26282a}Individual execution time of 1st query - 5 secs{color}
{color:#26282a}Individual execution time of 2nd query - 5 secs{color}
{color:#26282a}UNION ALL of both queries execution time - 10 secs{color}
{color:#26282a} {color}
*{color:#26282a}Scenario 3 - Parquet UNION ALL DB query{color}*
{color:#26282a}Individual execution time of 1st query - 5 secs{color}
{color:#26282a}Individual execution time of 2nd query - 1 sec{color}
{color:#26282a}UNION ALL execution time - 20 secs{color}
{color:#26282a}Ideally the execution time should not be more than 6 secs.{color}
{color:#26282a}Config-{color}
{color:#26282a}HEAP memory - 16 GB{color}
{color:#26282a}DRILL_MAX_DIRECT_MEMORY{color} - 32 GB
2 Drillbits
Observation-
Observed that the query is distributed in 2 NODES when we are executing
individual query or executing UNION ALL between same type datasets. But query
is executing only on 1 NODE when we are executing UNION ALL between 2 types
datasets (like Parquet UNION ALL DB). The Union query is not being parallelized
i.e. split into multiple 'Minor Fragments'
Storage-
Storage is HDFS.
Parquet file size - 849 MB
Nature of query-
Both Parquet and DB query have some filter criteria. Those doesn't have sort or
join.
Time taken-
| |SCAN|Total|
|Parquet|2.018s|5.419 sec|
|DB|0.146s|0.257 sec|
|Parquet UNION ALL DB|15.632s|20.729 sec|
> Issue observed in performance of UNION ALL between Parquet and DB query
> -----------------------------------------------------------------------
>
> Key: DRILL-7720
> URL: https://issues.apache.org/jira/browse/DRILL-7720
> Project: Apache Drill
> Issue Type: Bug
> Components: Functions - Drill
> Affects Versions: 1.17.0
> Reporter: Sreeparna Bhabani
> Priority: Major
>
> {color:#26282a}Consider the below scenarios. The first 2 scenarios are giving
> expected results in terms of performance. But we are not getting expected
> performance for 3rd scenario which is UNION ALL with 2 different types of
> datasets (Parquet + DB).{color}
> {color:#26282a} {color}
> *{color:#26282a}Scenario 1- Parquet UNION ALL Parquet{color}*
> {color:#26282a}Individual execution time of 1st query - 5 secs{color}
> {color:#26282a}Individual execution time of 2nd query - 5 secs{color}
> {color:#26282a}UNION ALL of both queries execution time - 10 secs{color}
> {color:#26282a} {color}
> *{color:#26282a}Scenario 2 - DB query UNION ALL DB{color}*{color:#26282a}
> *query*{color}
> {color:#26282a}Individual execution time of 1st query - 5 secs{color}
> {color:#26282a}Individual execution time of 2nd query - 5 secs{color}
> {color:#26282a}UNION ALL of both queries execution time - 10 secs{color}
> {color:#26282a} {color}
> *{color:#26282a}Scenario 3 - Parquet UNION ALL DB query{color}*
> {color:#26282a}Individual execution time of 1st query - 5 secs{color}
> {color:#26282a}Individual execution time of 2nd query - 1 sec{color}
> {color:#26282a}UNION ALL execution time - 20 secs{color}
> {color:#26282a}Ideally the execution time should not be more than 6
> secs.{color}
>
> {color:#26282a}Config-{color}
> {color:#26282a}HEAP memory - 16 GB{color}
> {color:#26282a}DRILL_MAX_DIRECT_MEMORY{color} - 32 GB
> 2 Drillbits
>
> Observation-
> Observed that the query is distributed in 2 NODES when we are executing
> individual query or executing UNION ALL between same type datasets. But query
> is executing only on 1 NODE when we are executing UNION ALL between 2 types
> datasets (like Parquet UNION ALL DB). The Union query is not being
> parallelized i.e. split into multiple 'Minor Fragments'
>
> Storage-
> Storage is HDFS.
>
> Parquet file size - 849 MB
>
> Nature of query-
> Both Parquet and DB query have some filter criteria. Those doesn't have sort
> or join.
>
> Time taken-
> |SCAN|Total|
> |Parquet|2.018s|5.419 sec|
> |DB|0.146s|0.257 sec|
> |Parquet UNION ALL DB|15.632s|20.729 sec|
>
> The query we used-
> SELECT column1, column2,...column25
> FROM dfs.`root`.`Parquet`
> WHERE column1 = <> AND column2 = <> AND column3 = <>
> UNION ALL
> SELECT column1, column2,...column25
> FROM db.`Database`
> WHERE column1 = <> AND column2 = <> AND column3 = <>
>
> Below is the Plan after redacted names-
> 00-00 Screen
> 00-01 Project(Column1=[$0],...,Column25=[$24])
> 00-02 UnionExchange
> 01-01 UnionAll(all=[true])
> 01-03 Project (Column1=[$0],...,Column25=[$24])
> 01-05 SelectionVectorRemover
> 01-06 Filter(condition=[AND(=$0,'<value>',),=($1,'<value>',=($2,'<value>'))])
> 01-07 Scan(table=[[]], groupscan=[ParquetGroupScan], selectionroot=<>,
> numFiles=13, numRowGroups=69, usedMetadatafile=true, usedMetastore=false,
> filter=booleanAnd(..), cacheFileRoot=.., columns=[.....] )
> 01-02 Project(Column1=[$0],...,Column25=[$24])
> 01-04 Jdbc()
>
> json-
> {
> "head" : {
> "version":1,
> "generator" : \{"type":"ExplainHandler","info":""},
> "type" : "APACHE_DRILL_PHYSICAL","options" :
> [],"queue":0,"hasResourcePlan":false,"resultMode":"EXEC"
> },
> "graph":[
> {"pop":"jdbc-scan"},
> {"pop":"project","initialAllocation""
> 1000000,"maxAllocation":10000000000,"cost":\{memoryCost}:4194304,"outputRowCount":1},
> {"pop":"parquet-scan",
> "cost":\{"memoryCost":4194304,"outputRowCount":1.2287038E7} },
> {"pop":"filter" ,"initialAllocation""
> 1000000,"maxAllocation":10000000000,"cost":\{"memoryCost":4194304,"outputRowCount":41468.75}
> },
> {"pop":"selection-vector-remover" ,"initialAllocation""
> 1000000,"maxAllocation":10000000000,"cost":\{memoryCost}:4194304,"outputRowCount":41468.75
> } ,
> {"pop":"project" ,"initialAllocation""
> 1000000,"maxAllocation":10000000000,"cost":\{"memoryCost":4194304,"outputRowCount":41468.75}
> }
> {"pop":"union-all" ,"initialAllocation""
> 1000000,"maxAllocation":10000000000,"cost":\{"memoryCost":4194304,"outputRowCount":41468.75}
> } ,
> {"pop":"union-exchange" ,"initialAllocation""
> 1000000,"maxAllocation":10000000000,"cost":\{"memoryCost":4194304,"outputRowCount":41468.75
> } } ,
> {"pop":"project" ,"initialAllocation""
> 1000000,"maxAllocation":10000000000,"cost":\{"memoryCost":4194304,"outputRowCount":41468.75},"outputProj":true
> },
> {"pop":"screen" ,"initialAllocation""
> 1000000,"maxAllocation":10000000000,"cost":\{"memoryCost":4194304,"outputRowCount":41468.75}
> }
> ]
> }
--
This message was sent by Atlassian Jira
(v8.3.4#803005)