[ 
https://issues.apache.org/jira/browse/DRILL-5093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Hou updated DRILL-5093:
------------------------------
    Description: 
This query scans all the partitions because the partitions cannot be pruned.  
When metadata caching is used, the explain plan shows all the partitions, when 
it should only show the parent.

0: jdbc:drill:zk=10.10.100.186:5181/drill/rho> explain plan for select \* from 
orders_parts_metadata where int_id = -2000 or int_id = 0 or int_id = 4000 or 
int_id is null or int_id = 10000;
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      Project(*=[$0])
00-02        Project(T62¦¦*=[$0])
00-03          SelectionVectorRemover
00-04            Filter(condition=[OR(=($1, -2000), =($1, 0), =($1, 4000), IS 
NULL($1), =($1, 10000))])
00-05              Project(T62¦¦*=[$0], int_id=[$1])
00-06                Scan(groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath 
[path=/drill/testdata/filter/orders_parts_metadata/0_0_1.parquet], 
ReadEntryWithPath 
[path=/drill/testdata/filter/orders_parts_metadata/0_0_3.parquet], 
ReadEntryWithPath 
[path=/drill/testdata/filter/orders_parts_metadata/0_0_4.parquet], 
ReadEntryWithPath 
[path=/drill/testdata/filter/orders_parts_metadata/0_0_5.parquet], 
ReadEntryWithPath 
[path=/drill/testdata/filter/orders_parts_metadata/0_0_2.parquet]], 
selectionRoot=/drill/testdata/filter/orders_parts_metadata, numFiles=5, 
usedMetadataFile=true, 
cacheFileRoot=/drill/testdata/filter/orders_parts_metadata, columns=[`*`]]])


To reproduce the problem, put the attached files into a directory. Then create 
the metadata:
refresh table metadata dfs.`path_to_directory`;
For example, if you put the files in 
/drill/testdata/filter/orders_parts_metadata, then run this sql command
refresh table metadata dfs.`/drill/testdata/filter/orders_parts_metadata`;


Here is the same query with the same table without metadata caching.

0: jdbc:drill:zk=10.10.100.186:5181/drill/rho> explain plan for select \* from 
orders_parts where int_id = -2000 or int_id = 0 or int_id = 4000 or int_id is 
null or int_id = 10000;
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      Project(*=[$0])
00-02        Project(T63¦¦*=[$0])
00-03          SelectionVectorRemover
00-04            Filter(condition=[OR(=($1, -2000), =($1, 0), =($1, 4000), IS 
NULL($1), =($1, 10000))])
00-05              Project(T63¦¦*=[$0], int_id=[$1])
00-06                Scan(groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath 
[path=maprfs:///drill/testdata/filter/orders_parts]], 
selectionRoot=maprfs:/drill/testdata/filter/orders_parts, numFiles=1, 
usedMetadataFile=false, columns=[`*`]]])


  was:
This query scans all the partitions because the partitions cannot be pruned.  
When metadata caching is used, the explain plan shows all the partitions, when 
it should only show the parent.

0: jdbc:drill:zk=10.10.100.186:5181/drill/rho> explain plan for select \* from 
orders_parts_metadata;
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      Project(*=[$0])
00-02        Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
[path=/drill/testdata/filter/orders_parts_metadata/0_0_1.parquet], 
ReadEntryWithPath 
[path=/drill/testdata/filter/orders_parts_metadata/0_0_3.parquet], 
ReadEntryWithPath 
[path=/drill/testdata/filter/orders_parts_metadata/0_0_4.parquet], 
ReadEntryWithPath 
[path=/drill/testdata/filter/orders_parts_metadata/0_0_5.parquet], 
ReadEntryWithPath 
[path=/drill/testdata/filter/orders_parts_metadata/0_0_2.parquet]], 
selectionRoot=/drill/testdata/filter/orders_parts_metadata, numFiles=5, 
usedMetadataFile=true, 
cacheFileRoot=/drill/testdata/filter/orders_parts_metadata, columns=[`*`]]])


To reproduce the problem, put the attached files into a directory. Then create 
the metadata:
refresh table metadata dfs.`path_to_directory`;
For example, if you put the files in 
/drill/testdata/filter/orders_parts_metadata, then run this sql command
refresh table metadata dfs.`/drill/testdata/filter/orders_parts_metadata`;


Here is the same query with a table that does not have metadata caching.

0: jdbc:drill:zk=10.10.100.186:5181/drill/rho> explain plan for select \* from 
orders_parts;
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      Project(*=[$0])
00-02        Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
[path=maprfs:///drill/testdata/filter/orders_parts]], 
selectionRoot=maprfs:/drill/testdata/filter/orders_parts, numFiles=1, 
usedMetadataFile=false, columns=[`*`]]])



> Explain plan shows all partitions when query scans all partitions, and filter 
> pushdown is used with metadata caching.
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-5093
>                 URL: https://issues.apache.org/jira/browse/DRILL-5093
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>    Affects Versions: 1.9.0
>            Reporter: Robert Hou
>            Assignee: Jinfeng Ni
>         Attachments: 0_0_1.parquet, 0_0_2.parquet, 0_0_3.parquet, 
> 0_0_4.parquet, 0_0_5.parquet, drill.parquet_metadata
>
>
> This query scans all the partitions because the partitions cannot be pruned.  
> When metadata caching is used, the explain plan shows all the partitions, 
> when it should only show the parent.
> 0: jdbc:drill:zk=10.10.100.186:5181/drill/rho> explain plan for select \* 
> from orders_parts_metadata where int_id = -2000 or int_id = 0 or int_id = 
> 4000 or int_id is null or int_id = 10000;
> +------+------+
> | text | json |
> +------+------+
> | 00-00    Screen
> 00-01      Project(*=[$0])
> 00-02        Project(T62¦¦*=[$0])
> 00-03          SelectionVectorRemover
> 00-04            Filter(condition=[OR(=($1, -2000), =($1, 0), =($1, 4000), IS 
> NULL($1), =($1, 10000))])
> 00-05              Project(T62¦¦*=[$0], int_id=[$1])
> 00-06                Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath 
> [path=/drill/testdata/filter/orders_parts_metadata/0_0_1.parquet], 
> ReadEntryWithPath 
> [path=/drill/testdata/filter/orders_parts_metadata/0_0_3.parquet], 
> ReadEntryWithPath 
> [path=/drill/testdata/filter/orders_parts_metadata/0_0_4.parquet], 
> ReadEntryWithPath 
> [path=/drill/testdata/filter/orders_parts_metadata/0_0_5.parquet], 
> ReadEntryWithPath 
> [path=/drill/testdata/filter/orders_parts_metadata/0_0_2.parquet]], 
> selectionRoot=/drill/testdata/filter/orders_parts_metadata, numFiles=5, 
> usedMetadataFile=true, 
> cacheFileRoot=/drill/testdata/filter/orders_parts_metadata, columns=[`*`]]])
> To reproduce the problem, put the attached files into a directory. Then 
> create the metadata:
> refresh table metadata dfs.`path_to_directory`;
> For example, if you put the files in 
> /drill/testdata/filter/orders_parts_metadata, then run this sql command
> refresh table metadata dfs.`/drill/testdata/filter/orders_parts_metadata`;
> Here is the same query with the same table without metadata caching.
> 0: jdbc:drill:zk=10.10.100.186:5181/drill/rho> explain plan for select \* 
> from orders_parts where int_id = -2000 or int_id = 0 or int_id = 4000 or 
> int_id is null or int_id = 10000;
> +------+------+
> | text | json |
> +------+------+
> | 00-00    Screen
> 00-01      Project(*=[$0])
> 00-02        Project(T63¦¦*=[$0])
> 00-03          SelectionVectorRemover
> 00-04            Filter(condition=[OR(=($1, -2000), =($1, 0), =($1, 4000), IS 
> NULL($1), =($1, 10000))])
> 00-05              Project(T63¦¦*=[$0], int_id=[$1])
> 00-06                Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath 
> [path=maprfs:///drill/testdata/filter/orders_parts]], 
> selectionRoot=maprfs:/drill/testdata/filter/orders_parts, numFiles=1, 
> usedMetadataFile=false, columns=[`*`]]])



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to