[ https://issues.apache.org/jira/browse/DRILL-7720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sreeparna Bhabani updated DRILL-7720: ------------------------------------- Description: {color:#26282a}Consider the below scenarios. The first 2 scenarios are giving expected results in terms of performance. But we are not getting expected performance for 3rd scenario which is UNION ALL with 2 different types of datasets (Parquet + DB).{color} {color:#26282a} {color} *{color:#26282a}Scenario 1- Parquet UNION ALL Parquet{color}* {color:#26282a}Individual execution time of 1st query - 5 secs{color} {color:#26282a}Individual execution time of 2nd query - 5 secs{color} {color:#26282a}UNION ALL of both queries execution time - 10 secs{color} {color:#26282a} {color} *{color:#26282a}Scenario 2 - DB query UNION ALL DB{color}*{color:#26282a} *query*{color} {color:#26282a}Individual execution time of 1st query - 5 secs{color} {color:#26282a}Individual execution time of 2nd query - 5 secs{color} {color:#26282a}UNION ALL of both queries execution time - 10 secs{color} {color:#26282a} {color} *{color:#26282a}Scenario 3 - Parquet UNION ALL DB query{color}* {color:#26282a}Individual execution time of 1st query - 5 secs{color} {color:#26282a}Individual execution time of 2nd query - 1 sec{color} {color:#26282a}UNION ALL execution time - 20 secs{color} {color:#26282a}Ideally the execution time should not be more than 6 secs.{color} {color:#26282a}Config-{color} {color:#26282a}HEAP memory - 16 GB{color} {color:#26282a}DRILL_MAX_DIRECT_MEMORY{color} - 32 GB 2 Drillbits Observation- Observed that the query is distributed in 2 NODES when we are executing individual query or executing UNION ALL between same type datasets. But query is executing only on 1 NODE when we are executing UNION ALL between 2 types datasets (like Parquet UNION ALL DB). The Union query is not being parallelized i.e. split into multiple 'Minor Fragments' Storage- Storage is HDFS. Parquet file size - 849 MB Nature of query- Both Parquet and DB query have some filter criteria. Those doesn't have sort or join. Time taken- |SCAN|Total| |Parquet|2.018s|5.419 sec| |DB|0.146s|0.257 sec| |Parquet UNION ALL DB|15.632s|20.729 sec| The query we used- SELECT column1, column2,...column25 FROM dfs.`root`.`Parquet` WHERE column1 = <> AND column2 = <> AND column3 = <> UNION ALL SELECT column1, column2,...column25 FROM db.`Database` WHERE column1 = <> AND column2 = <> AND column3 = <> Below is the Plan after redacted names- 00-00 Screen 00-01 Project(Column1=[$0],...,Column25=[$24]) 00-02 UnionExchange 01-01 UnionAll(all=[true]) 01-03 Project (Column1=[$0],...,Column25=[$24]) 01-05 SelectionVectorRemover 01-06 Filter(condition=[AND(=$0,'<value>',),=($1,'<value>',=($2,'<value>'))]) 01-07 Scan(table=[[]], groupscan=[ParquetGroupScan], selectionroot=<>, numFiles=13, numRowGroups=69, usedMetadatafile=true, usedMetastore=false, filter=booleanAnd(..), cacheFileRoot=.., columns=[.....] ) 01-02 Project(Column1=[$0],...,Column25=[$24]) 01-04 Jdbc() json- { "head" : { "version":1, "generator" : \{"type":"ExplainHandler","info":""}, "type" : "APACHE_DRILL_PHYSICAL","options" : [],"queue":0,"hasResourcePlan":false,"resultMode":"EXEC" }, "graph":[ {"pop":"jdbc-scan"}, {"pop":"project","initialAllocation"" 1000000,"maxAllocation":10000000000,"cost":\{memoryCost}:4194304,"outputRowCount":1}, {"pop":"parquet-scan", "cost":\{"memoryCost":4194304,"outputRowCount":1.2287038E7} }, {"pop":"filter" ,"initialAllocation"" 1000000,"maxAllocation":10000000000,"cost":\{"memoryCost":4194304,"outputRowCount":41468.75} }, {"pop":"selection-vector-remover" ,"initialAllocation"" 1000000,"maxAllocation":10000000000,"cost":\{memoryCost}:4194304,"outputRowCount":41468.75 } , {"pop":"project" ,"initialAllocation"" 1000000,"maxAllocation":10000000000,"cost":\{"memoryCost":4194304,"outputRowCount":41468.75} } {"pop":"union-all" ,"initialAllocation"" 1000000,"maxAllocation":10000000000,"cost":\{"memoryCost":4194304,"outputRowCount":41468.75} } , {"pop":"union-exchange" ,"initialAllocation"" 1000000,"maxAllocation":10000000000,"cost":\{"memoryCost":4194304,"outputRowCount":41468.75 } } , {"pop":"project" ,"initialAllocation"" 1000000,"maxAllocation":10000000000,"cost":\{"memoryCost":4194304,"outputRowCount":41468.75},"outputProj":true }, {"pop":"screen" ,"initialAllocation"" 1000000,"maxAllocation":10000000000,"cost":\{"memoryCost":4194304,"outputRowCount":41468.75} } ] } was: {color:#26282a}Consider the below scenarios. The first 2 scenarios are giving expected results in terms of performance. But we are not getting expected performance for 3rd scenario which is UNION ALL with 2 different types of datasets (Parquet + DB).{color} {color:#26282a} {color} *{color:#26282a}Scenario 1- Parquet UNION ALL Parquet{color}* {color:#26282a}Individual execution time of 1st query - 5 secs{color} {color:#26282a}Individual execution time of 2nd query - 5 secs{color} {color:#26282a}UNION ALL of both queries execution time - 10 secs{color} {color:#26282a} {color} *{color:#26282a}Scenario 2 - DB query UNION ALL DB{color}*{color:#26282a} *query*{color} {color:#26282a}Individual execution time of 1st query - 5 secs{color} {color:#26282a}Individual execution time of 2nd query - 5 secs{color} {color:#26282a}UNION ALL of both queries execution time - 10 secs{color} {color:#26282a} {color} *{color:#26282a}Scenario 3 - Parquet UNION ALL DB query{color}* {color:#26282a}Individual execution time of 1st query - 5 secs{color} {color:#26282a}Individual execution time of 2nd query - 1 sec{color} {color:#26282a}UNION ALL execution time - 20 secs{color} {color:#26282a}Ideally the execution time should not be more than 6 secs.{color} {color:#26282a}Config-{color} {color:#26282a}HEAP memory - 16 GB{color} {color:#26282a}DRILL_MAX_DIRECT_MEMORY{color} - 32 GB 2 Drillbits Observation- Observed that the query is distributed in 2 NODES when we are executing individual query or executing UNION ALL between same type datasets. But query is executing only on 1 NODE when we are executing UNION ALL between 2 types datasets (like Parquet UNION ALL DB). The Union query is not being parallelized i.e. split into multiple 'Minor Fragments' Storage- Storage is HDFS. Parquet file size - 849 MB Nature of query- Both Parquet and DB query have some filter criteria. Those doesn't have sort or join. Time taken- | |SCAN|Total| |Parquet|2.018s|5.419 sec| |DB|0.146s|0.257 sec| |Parquet UNION ALL DB|15.632s|20.729 sec| > Issue observed in performance of UNION ALL between Parquet and DB query > ----------------------------------------------------------------------- > > Key: DRILL-7720 > URL: https://issues.apache.org/jira/browse/DRILL-7720 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill > Affects Versions: 1.17.0 > Reporter: Sreeparna Bhabani > Priority: Major > > {color:#26282a}Consider the below scenarios. The first 2 scenarios are giving > expected results in terms of performance. But we are not getting expected > performance for 3rd scenario which is UNION ALL with 2 different types of > datasets (Parquet + DB).{color} > {color:#26282a} {color} > *{color:#26282a}Scenario 1- Parquet UNION ALL Parquet{color}* > {color:#26282a}Individual execution time of 1st query - 5 secs{color} > {color:#26282a}Individual execution time of 2nd query - 5 secs{color} > {color:#26282a}UNION ALL of both queries execution time - 10 secs{color} > {color:#26282a} {color} > *{color:#26282a}Scenario 2 - DB query UNION ALL DB{color}*{color:#26282a} > *query*{color} > {color:#26282a}Individual execution time of 1st query - 5 secs{color} > {color:#26282a}Individual execution time of 2nd query - 5 secs{color} > {color:#26282a}UNION ALL of both queries execution time - 10 secs{color} > {color:#26282a} {color} > *{color:#26282a}Scenario 3 - Parquet UNION ALL DB query{color}* > {color:#26282a}Individual execution time of 1st query - 5 secs{color} > {color:#26282a}Individual execution time of 2nd query - 1 sec{color} > {color:#26282a}UNION ALL execution time - 20 secs{color} > {color:#26282a}Ideally the execution time should not be more than 6 > secs.{color} > > {color:#26282a}Config-{color} > {color:#26282a}HEAP memory - 16 GB{color} > {color:#26282a}DRILL_MAX_DIRECT_MEMORY{color} - 32 GB > 2 Drillbits > > Observation- > Observed that the query is distributed in 2 NODES when we are executing > individual query or executing UNION ALL between same type datasets. But query > is executing only on 1 NODE when we are executing UNION ALL between 2 types > datasets (like Parquet UNION ALL DB). The Union query is not being > parallelized i.e. split into multiple 'Minor Fragments' > > Storage- > Storage is HDFS. > > Parquet file size - 849 MB > > Nature of query- > Both Parquet and DB query have some filter criteria. Those doesn't have sort > or join. > > Time taken- > |SCAN|Total| > |Parquet|2.018s|5.419 sec| > |DB|0.146s|0.257 sec| > |Parquet UNION ALL DB|15.632s|20.729 sec| > > The query we used- > SELECT column1, column2,...column25 > FROM dfs.`root`.`Parquet` > WHERE column1 = <> AND column2 = <> AND column3 = <> > UNION ALL > SELECT column1, column2,...column25 > FROM db.`Database` > WHERE column1 = <> AND column2 = <> AND column3 = <> > > Below is the Plan after redacted names- > 00-00 Screen > 00-01 Project(Column1=[$0],...,Column25=[$24]) > 00-02 UnionExchange > 01-01 UnionAll(all=[true]) > 01-03 Project (Column1=[$0],...,Column25=[$24]) > 01-05 SelectionVectorRemover > 01-06 Filter(condition=[AND(=$0,'<value>',),=($1,'<value>',=($2,'<value>'))]) > 01-07 Scan(table=[[]], groupscan=[ParquetGroupScan], selectionroot=<>, > numFiles=13, numRowGroups=69, usedMetadatafile=true, usedMetastore=false, > filter=booleanAnd(..), cacheFileRoot=.., columns=[.....] ) > 01-02 Project(Column1=[$0],...,Column25=[$24]) > 01-04 Jdbc() > > json- > { > "head" : { > "version":1, > "generator" : \{"type":"ExplainHandler","info":""}, > "type" : "APACHE_DRILL_PHYSICAL","options" : > [],"queue":0,"hasResourcePlan":false,"resultMode":"EXEC" > }, > "graph":[ > {"pop":"jdbc-scan"}, > {"pop":"project","initialAllocation"" > 1000000,"maxAllocation":10000000000,"cost":\{memoryCost}:4194304,"outputRowCount":1}, > {"pop":"parquet-scan", > "cost":\{"memoryCost":4194304,"outputRowCount":1.2287038E7} }, > {"pop":"filter" ,"initialAllocation"" > 1000000,"maxAllocation":10000000000,"cost":\{"memoryCost":4194304,"outputRowCount":41468.75} > }, > {"pop":"selection-vector-remover" ,"initialAllocation"" > 1000000,"maxAllocation":10000000000,"cost":\{memoryCost}:4194304,"outputRowCount":41468.75 > } , > {"pop":"project" ,"initialAllocation"" > 1000000,"maxAllocation":10000000000,"cost":\{"memoryCost":4194304,"outputRowCount":41468.75} > } > {"pop":"union-all" ,"initialAllocation"" > 1000000,"maxAllocation":10000000000,"cost":\{"memoryCost":4194304,"outputRowCount":41468.75} > } , > {"pop":"union-exchange" ,"initialAllocation"" > 1000000,"maxAllocation":10000000000,"cost":\{"memoryCost":4194304,"outputRowCount":41468.75 > } } , > {"pop":"project" ,"initialAllocation"" > 1000000,"maxAllocation":10000000000,"cost":\{"memoryCost":4194304,"outputRowCount":41468.75},"outputProj":true > }, > {"pop":"screen" ,"initialAllocation"" > 1000000,"maxAllocation":10000000000,"cost":\{"memoryCost":4194304,"outputRowCount":41468.75} > } > ] > } -- This message was sent by Atlassian Jira (v8.3.4#803005)