[
https://issues.apache.org/jira/browse/DRILL-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089662#comment-16089662
]
Arina Ielchiieva edited comment on DRILL-4735 at 7/17/17 1:16 PM:
------------------------------------------------------------------
Fix will cover the following aspects:
1. {{ConvertCountToDirectScan}} will be able to distinguish between implicit /
directory and non-existent columns, relates to current Jira DRILL-4735.
To achieve this `Agg_on_scan` and `Agg_on_proj_on_scan` rules will take new
parameter in constructor {{OptimizerRulesContext}}, similar for [prune scan
rules|https://github.com/apache/drill/blob/3e8b01d5b0d3013e3811913f0fd6028b22c1ac3f/exec/java-exec/src/main/java/org/apache/drill/exec/planner/PlannerPhase.java#L345].
It will help to find out if column is implicit / directory or not.
{{OptimizerRulesContext}} has access to session options through
{{PlannerSettings}} which are crucial for defining current implicit / directory
column names. After we receive list of columns to which rule will be applied,
we'll checks if this list contains implicit or directory columns. If contains,
we won't apply the rule.
2. {{ConvertCountToDirectScan}} rule to be applicable for 2 or more COUNT
aggregates, relates to DRILL-1691.
When column statistics is available we use {{PojoRecordReader}} to return its
value. {{PojoRecordReader}} requires exact model. In our case we'll need reader
that will allow dynamic model usage (the one where we don't know how many
columns it will have). For this purpose {{DynamicPojoRecordReader}} will be
used. Instead of exact model it will accept dynamic model represented by
{{List<LinkedHashMap<String, Object>> records}}, list of maps where key ->
column name, value -> column value. Common logic between {{PojoRecordReader}}
and {{DynamicPojoRecordReader}} will extracted in abstract parent class.
3. Currently when {{ConvertCountToDirectScan}} is applied in plan we see the
following:
{noformat}
Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@60017daf[columns
= null, isStarQuery = false, isSkipQuery = false]])
{noformat}
User has no idea that column statistics was used to calculate the result, if
partition pruning took place etc, relates to DRILL-5357 and DRILL-3407.
Currently we use {{DirectGroupScan}} to hold our record reader. To include more
information we'll extend {{DirectGroupScan}} to {{MetadataDirectGroupScan}}
which will contain information about read files if any. Also
{{PojoRecordReader}} and {{DirectPojoRecordReader}} {{toString}} methods will
be overridden to show meaningful information to the user.
Example:
{noformat}
Scan(groupscan=[usedMetadata = true, files = [/tpch/nation.parquet],
DynamicPojoRecordReader{records=[{count0=25, count1=25, count2=25}]}])
{noformat}
was (Author: arina):
Fix will cover the following aspects:
1. {{ConvertCountToDirectScan}} will be able to distinguish between implicit /
directory and non-existent columns, relates to current Jira DRILL-4735.
To achieve this `Agg_on_scan` and `Agg_on_proj_on_scan` rules will take new
parameter in constructor {{OptimizerRulesContext}}, similar for [prune scan
rules|https://github.com/apache/drill/blob/3e8b01d5b0d3013e3811913f0fd6028b22c1ac3f/exec/java-exec/src/main/java/org/apache/drill/exec/planner/PlannerPhase.java#L345].
It will help to find out if column is implicit / directory or not.
{{OptimizerRulesContext}} has access to session options through
{{PlannerSettings}} which are crucial for defining current implicit / directory
column names. In case when [GroupScan will return column value as
0|https://github.com/apache/drill/blob/3e8b01d5b0d3013e3811913f0fd6028b22c1ac3f/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/ConvertCountToDirectScan.java#L140],
we'll check if this column implicit / directory or not. If not, we'll proceed
with applying our rule.
2. {{ConvertCountToDirectScan}} rule to be applicable for 2 or more COUNT
aggregates, relates to DRILL-1691.
When column statistics is available we use {{PojoRecordReader}} to return its
value. {{PojoRecordReader}} requires exact model. In our case we'll need reader
that will allow dynamic model usage (the one where we don't know how many
columns it will have). For this purpose {{DynamicPojoRecordReader}} will be
used. Instead of exact model it will accept dynamic model represented by
{{List<LinkedHashMap<String, Object>> records}}, list of maps where key ->
column name, value -> column value. Common logic between {{PojoRecordReader}}
and {{DynamicPojoRecordReader}} will extracted in abstract parent class.
3. Currently when {{ConvertCountToDirectScan}} is applied in plan we see the
following:
{noformat}
Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@60017daf[columns
= null, isStarQuery = false, isSkipQuery = false]])
{noformat}
User has no idea that column statistics was used to calculate the result, if
partition pruning took place etc, relates to DRILL-5357 and DRILL-3407.
Currently we use {{DirectGroupScan}} to hold our record reader. To include more
information we'll extend {{DirectGroupScan}} to {{MetadataDirectGroupScan}}
which will contain information about read files if any. Also
{{PojoRecordReader}} and {{DirectPojoRecordReader}} {{toString}} methods will
be overridden to show meaningful information to the user.
Example:
{noformat}
Scan(groupscan=[usedMetadata = true, files = [/tpch/nation.parquet],
DynamicPojoRecordReader{records=[{count0=25, count1=25, count2=25}]}])
{noformat}
> Count(dir0) on parquet returns 0 result
> ---------------------------------------
>
> Key: DRILL-4735
> URL: https://issues.apache.org/jira/browse/DRILL-4735
> Project: Apache Drill
> Issue Type: Bug
> Components: Query Planning & Optimization, Storage - Parquet
> Affects Versions: 1.0.0, 1.4.0, 1.6.0, 1.7.0
> Reporter: Krystal
> Assignee: Arina Ielchiieva
> Priority: Critical
>
> Selecting a count of dir0, dir1, etc against a parquet directory returns 0
> rows.
> select count(dir0) from `min_max_dir`;
> +---------+
> | EXPR$0 |
> +---------+
> | 0 |
> +---------+
> select count(dir1) from `min_max_dir`;
> +---------+
> | EXPR$0 |
> +---------+
> | 0 |
> +---------+
> If I put both dir0 and dir1 in the same select, it returns expected result:
> select count(dir0), count(dir1) from `min_max_dir`;
> +---------+---------+
> | EXPR$0 | EXPR$1 |
> +---------+---------+
> | 600 | 600 |
> +---------+---------+
> Here is the physical plan for count(dir0) query:
> {code}
> 00-00 Screen : rowType = RecordType(BIGINT EXPR$0): rowcount = 20.0,
> cumulative cost = {22.0 rows, 22.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id
> = 1346
> 00-01 Project(EXPR$0=[$0]) : rowType = RecordType(BIGINT EXPR$0):
> rowcount = 20.0, cumulative cost = {20.0 rows, 20.0 cpu, 0.0 io, 0.0 network,
> 0.0 memory}, id = 1345
> 00-02 Project(EXPR$0=[$0]) : rowType = RecordType(BIGINT EXPR$0):
> rowcount = 20.0, cumulative cost = {20.0 rows, 20.0 cpu, 0.0 io, 0.0 network,
> 0.0 memory}, id = 1344
> 00-03
> Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@3da85d3b[columns
> = null, isStarQuery = false, isSkipQuery = false]]) : rowType =
> RecordType(BIGINT count): rowcount = 20.0, cumulative cost = {20.0 rows, 20.0
> cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1343
> {code}
> Here is part of the explain plan for the count(dir0) and count(dir1) in the
> same select:
> {code}
> 00-00 Screen : rowType = RecordType(BIGINT EXPR$0, BIGINT EXPR$1):
> rowcount = 60.0, cumulative cost = {1206.0 rows, 15606.0 cpu, 0.0 io, 0.0
> network, 0.0 memory}, id = 1623
> 00-01 Project(EXPR$0=[$0], EXPR$1=[$1]) : rowType = RecordType(BIGINT
> EXPR$0, BIGINT EXPR$1): rowcount = 60.0, cumulative cost = {1200.0 rows,
> 15600.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1622
> 00-02 StreamAgg(group=[{}], EXPR$0=[COUNT($0)], EXPR$1=[COUNT($1)]) :
> rowType = RecordType(BIGINT EXPR$0, BIGINT EXPR$1): rowcount = 60.0,
> cumulative cost = {1200.0 rows, 15600.0 cpu, 0.0 io, 0.0 network, 0.0
> memory}, id = 1621
> 00-03 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath
> [path=maprfs:/drill/testdata/min_max_dir/1999/Apr/voter20.parquet/0_0_0.parquet],
> ReadEntryWithPath
> [path=maprfs:/drill/testdata/min_max_dir/1999/MAR/voter15.parquet/0_0_0.parquet],
> ReadEntryWithPath
> [path=maprfs:/drill/testdata/min_max_dir/1985/jan/voter5.parquet/0_0_0.parquet],
> ReadEntryWithPath
> [path=maprfs:/drill/testdata/min_max_dir/1985/apr/voter60.parquet/0_0_0.parquet],...,
> ReadEntryWithPath
> [path=maprfs:/drill/testdata/min_max_dir/2014/jul/voter35.parquet/0_0_0.parquet]],
> selectionRoot=maprfs:/drill/testdata/min_max_dir, numFiles=16,
> usedMetadataFile=false, columns=[`dir0`, `dir1`]]]) : rowType =
> RecordType(ANY dir0, ANY dir1): rowcount = 600.0, cumulative cost = {600.0
> rows, 1200.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1620
> {code}
> Notice that in the first case,
> "org.apache.drill.exec.store.pojo.PojoRecordReader" is used.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)