Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16723 )

Change subject: IMPALA-10314: Optimize planning time for simple limits
......................................................................


Patch Set 5:

(2 comments)

Looks very good!

The empty parquet data file may be a corner case to worry about.

http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@870
PS5, Line 870: fd.getFileLength()
Is it possible for a Parquet data file with empty # of rows pass this test? 
Note that due to the meta-data portion, such a file will have some number of 
bytes.  See one case here https://github.com/G-Research/ParquetSharp/issues/110.

If we can look at the meta-data of such a file, the number of rows is right 
there.

871 struct FileMetaData {                 
872   /** Version of this file **/
873   1: required i32 version                                              
874                                                   
875   /** Parquet schema for this file.  This schema contains metadata for all 
the columns.
876    * The schema is represented as a tree with a single root.  The nodes of 
the tree
877    * are flattened to a list by doing a depth-first traversal.
878    * The column metadata contains the path in the schema for that column 
which can be
879    * used to map columns to nodes in the schema.                            
         
880    * The first element is the root **/                          
881   2: required list<SchemaElement> schema;  
882                       
883   /** Number of rows in this file **/                                       
884   3: required i64 num_rows


http://gerrit.cloudera.org:8080/#/c/16723/5/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@947
PS5, Line 947: if (isSimpleLimit && simpleLimitNumRows ==
             :           analyzer.getSimpleLimitStatus().second) {
             :         // for the simple limit case if the estimated rows has 
already reached the limit
             :         // there's no need to process more partitions
             :         break;
             :       }
This is good.



--
To view, visit http://gerrit.cloudera.org:8080/16723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d6a79263bc092e0f3e9a1d72da5618f3cc35574
Gerrit-Change-Number: 16723
Gerrit-PatchSet: 5
Gerrit-Owner: Aman Sinha <amsi...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <amsi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Shant Hovsepian <sh...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Comment-Date: Thu, 19 Nov 2020 14:30:23 +0000
Gerrit-HasComments: Yes

Reply via email to