[jira] [Comment Edited] (DRILL-4735) Count(dir0) on parquet returns 0 result

Arina Ielchiieva (JIRA) Mon, 17 Jul 2017 06:17:23 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089662#comment-16089662
 ]


Arina Ielchiieva edited comment on DRILL-4735 at 7/17/17 1:16 PM:
------------------------------------------------------------------

Fix will cover the following aspects:

1. {{ConvertCountToDirectScan}} will be able to distinguish between implicit / 
directory and non-existent columns, relates to current Jira DRILL-4735.
To achieve this `Agg_on_scan` and `Agg_on_proj_on_scan` rules will take new 
parameter in constructor {{OptimizerRulesContext}}, similar for [prune scan 
rules|https://github.com/apache/drill/blob/3e8b01d5b0d3013e3811913f0fd6028b22c1ac3f/exec/java-exec/src/main/java/org/apache/drill/exec/planner/PlannerPhase.java#L345].
 It will help to find out if column is implicit / directory or not. 
{{OptimizerRulesContext}} has access to session options through 
{{PlannerSettings}} which are crucial for defining current implicit / directory 
column names. After we receive list of columns to which rule will be applied, 
we'll checks if this list contains implicit or directory columns. If contains, 
we won't apply the rule.

2. {{ConvertCountToDirectScan}} rule to be applicable for 2 or more COUNT 
aggregates, relates to DRILL-1691.
When column statistics is available we use {{PojoRecordReader}} to return its 
value. {{PojoRecordReader}} requires exact model. In our case we'll need reader 
that will allow dynamic model usage (the one where we don't know how many 
columns it will have). For this purpose {{DynamicPojoRecordReader}} will be 
used. Instead of exact model it will accept dynamic model represented by 
{{List<LinkedHashMap<String, Object>> records}}, list of maps where key -> 
column name, value -> column value. Common logic between  {{PojoRecordReader}} 
and {{DynamicPojoRecordReader}} will extracted in abstract parent class.

3. Currently when {{ConvertCountToDirectScan}} is applied in plan we see the 
following:
{noformat}
Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@60017daf[columns
 = null, isStarQuery = false, isSkipQuery = false]])
{noformat}
User has no idea that column statistics was used to calculate the result, if 
partition pruning took place etc, relates to DRILL-5357 and DRILL-3407.
Currently we use {{DirectGroupScan}} to hold our record reader. To include more 
information we'll extend {{DirectGroupScan}} to {{MetadataDirectGroupScan}} 
which will contain information about read files if any. Also 
{{PojoRecordReader}} and {{DirectPojoRecordReader}} {{toString}} methods will 
be overridden to show meaningful information to the user.
Example:
{noformat}
Scan(groupscan=[usedMetadata = true, files = [/tpch/nation.parquet], 
DynamicPojoRecordReader{records=[{count0=25, count1=25, count2=25}]}])
{noformat}




was (Author: arina):
Fix will cover the following aspects:

1. {{ConvertCountToDirectScan}} will be able to distinguish between implicit / 
directory and non-existent columns, relates to current Jira DRILL-4735.
To achieve this `Agg_on_scan` and `Agg_on_proj_on_scan` rules will take new 
parameter in constructor {{OptimizerRulesContext}}, similar for [prune scan 
rules|https://github.com/apache/drill/blob/3e8b01d5b0d3013e3811913f0fd6028b22c1ac3f/exec/java-exec/src/main/java/org/apache/drill/exec/planner/PlannerPhase.java#L345].
 It will help to find out if column is implicit / directory or not. 
{{OptimizerRulesContext}} has access to session options through 
{{PlannerSettings}} which are crucial for defining current implicit / directory 
column names. In case when [GroupScan will return column value as 
0|https://github.com/apache/drill/blob/3e8b01d5b0d3013e3811913f0fd6028b22c1ac3f/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/ConvertCountToDirectScan.java#L140],
 we'll check if this column implicit / directory or not.  If not, we'll proceed 
with applying our rule.

2. {{ConvertCountToDirectScan}} rule to be applicable for 2 or more COUNT 
aggregates, relates to DRILL-1691.
When column statistics is available we use {{PojoRecordReader}} to return its 
value. {{PojoRecordReader}} requires exact model. In our case we'll need reader 
that will allow dynamic model usage (the one where we don't know how many 
columns it will have). For this purpose {{DynamicPojoRecordReader}} will be 
used. Instead of exact model it will accept dynamic model represented by 
{{List<LinkedHashMap<String, Object>> records}}, list of maps where key -> 
column name, value -> column value. Common logic between  {{PojoRecordReader}} 
and {{DynamicPojoRecordReader}} will extracted in abstract parent class.

3. Currently when {{ConvertCountToDirectScan}} is applied in plan we see the 
following:
{noformat}
Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@60017daf[columns
 = null, isStarQuery = false, isSkipQuery = false]])
{noformat}
User has no idea that column statistics was used to calculate the result, if 
partition pruning took place etc, relates to DRILL-5357 and DRILL-3407.
Currently we use {{DirectGroupScan}} to hold our record reader. To include more 
information we'll extend {{DirectGroupScan}} to {{MetadataDirectGroupScan}} 
which will contain information about read files if any. Also 
{{PojoRecordReader}} and {{DirectPojoRecordReader}} {{toString}} methods will 
be overridden to show meaningful information to the user.
Example:
{noformat}
Scan(groupscan=[usedMetadata = true, files = [/tpch/nation.parquet], 
DynamicPojoRecordReader{records=[{count0=25, count1=25, count2=25}]}])
{noformat}



> Count(dir0) on parquet returns 0 result
> ---------------------------------------
>
>                 Key: DRILL-4735
>                 URL: https://issues.apache.org/jira/browse/DRILL-4735
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization, Storage - Parquet
>    Affects Versions: 1.0.0, 1.4.0, 1.6.0, 1.7.0
>            Reporter: Krystal
>            Assignee: Arina Ielchiieva
>            Priority: Critical
>
> Selecting a count of dir0, dir1, etc against a parquet directory returns 0 
> rows.
> select count(dir0) from `min_max_dir`;
> +---------+
> | EXPR$0  |
> +---------+
> | 0       |
> +---------+
> select count(dir1) from `min_max_dir`;
> +---------+
> | EXPR$0  |
> +---------+
> | 0       |
> +---------+
> If I put both dir0 and dir1 in the same select, it returns expected result:
> select count(dir0), count(dir1) from `min_max_dir`;
> +---------+---------+
> | EXPR$0  | EXPR$1  |
> +---------+---------+
> | 600     | 600     |
> +---------+---------+
> Here is the physical plan for count(dir0) query:
> {code}
> 00-00    Screen : rowType = RecordType(BIGINT EXPR$0): rowcount = 20.0, 
> cumulative cost = {22.0 rows, 22.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id 
> = 1346
> 00-01      Project(EXPR$0=[$0]) : rowType = RecordType(BIGINT EXPR$0): 
> rowcount = 20.0, cumulative cost = {20.0 rows, 20.0 cpu, 0.0 io, 0.0 network, 
> 0.0 memory}, id = 1345
> 00-02        Project(EXPR$0=[$0]) : rowType = RecordType(BIGINT EXPR$0): 
> rowcount = 20.0, cumulative cost = {20.0 rows, 20.0 cpu, 0.0 io, 0.0 network, 
> 0.0 memory}, id = 1344
> 00-03          
> Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@3da85d3b[columns
>  = null, isStarQuery = false, isSkipQuery = false]]) : rowType = 
> RecordType(BIGINT count): rowcount = 20.0, cumulative cost = {20.0 rows, 20.0 
> cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1343
> {code}
> Here is part of the explain plan for the count(dir0) and count(dir1) in the 
> same select:
> {code}
> 00-00    Screen : rowType = RecordType(BIGINT EXPR$0, BIGINT EXPR$1): 
> rowcount = 60.0, cumulative cost = {1206.0 rows, 15606.0 cpu, 0.0 io, 0.0 
> network, 0.0 memory}, id = 1623
> 00-01      Project(EXPR$0=[$0], EXPR$1=[$1]) : rowType = RecordType(BIGINT 
> EXPR$0, BIGINT EXPR$1): rowcount = 60.0, cumulative cost = {1200.0 rows, 
> 15600.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1622
> 00-02        StreamAgg(group=[{}], EXPR$0=[COUNT($0)], EXPR$1=[COUNT($1)]) : 
> rowType = RecordType(BIGINT EXPR$0, BIGINT EXPR$1): rowcount = 60.0, 
> cumulative cost = {1200.0 rows, 15600.0 cpu, 0.0 io, 0.0 network, 0.0 
> memory}, id = 1621
> 00-03          Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=maprfs:/drill/testdata/min_max_dir/1999/Apr/voter20.parquet/0_0_0.parquet],
>  ReadEntryWithPath 
> [path=maprfs:/drill/testdata/min_max_dir/1999/MAR/voter15.parquet/0_0_0.parquet],
>  ReadEntryWithPath 
> [path=maprfs:/drill/testdata/min_max_dir/1985/jan/voter5.parquet/0_0_0.parquet],
>  ReadEntryWithPath 
> [path=maprfs:/drill/testdata/min_max_dir/1985/apr/voter60.parquet/0_0_0.parquet],...,
>  ReadEntryWithPath 
> [path=maprfs:/drill/testdata/min_max_dir/2014/jul/voter35.parquet/0_0_0.parquet]],
>  selectionRoot=maprfs:/drill/testdata/min_max_dir, numFiles=16, 
> usedMetadataFile=false, columns=[`dir0`, `dir1`]]]) : rowType = 
> RecordType(ANY dir0, ANY dir1): rowcount = 600.0, cumulative cost = {600.0 
> rows, 1200.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1620
> {code}
> Notice that in the first case, 
> "org.apache.drill.exec.store.pojo.PojoRecordReader" is used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (DRILL-4735) Count(dir0) on parquet returns 0 result

Reply via email to