Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics 
to skip pages based on equi-join predicate
......................................................................


Patch Set 22:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16720/12//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16720/12//COMMIT_MSG@9
PS12, Line 9: This patch adds the logic to utilize min/max stats
> Okay.
Here is another good use case

select a.l_orderkey from               
lineitem a join [SHUFFLE] orders b                                              
              
where a.l_orderkey = b.o_orderkey                                   
and b.o_custkey = 5 order by l_orderkey ;

where 92 pages were filtered out in one impalad.

The reason for the good filtering efficiency is that the low values in pages on 
l_orderkey are somewhat monotonic.

In ::EvalOverlap(), data: min=1, max=4275687, filter: min=224167, max=2630562
In ::EvalOverlap(), data: min=1, max=3228897, filter: min=224167, max=2630562
In ::EvalOverlap(), data: min=17959, max=3246245, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=35235, max=3263680, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=52737, max=3280196, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=68708, max=3290466, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=79142, max=3294566, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=83169, max=3298692, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=87237, max=3302721, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=91328, max=3306725, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=95329, max=3310720, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=99334, max=3314819, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=103428, max=3318818, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=107458, max=3322980, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=111462, max=3327015, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=115591, max=3331171, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=119712, max=3335331, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=123810, max=3339429, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=127971, max=3343553, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=131973, max=3347585, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=136102, max=3351588, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=140354, max=3355745, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=144422, max=3359684, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=148544, max=3363814, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=152546, max=3367907, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=156640, max=3372006, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=160582, max=3376130, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=164677, max=3380229, filter: min=224167, 
max=2630562
In ::EvalOverlap(), data: min=168770, max=3384258, filter: min=224167, 
max=2630562


It seems we can help create such an arrangement for a Parquet data file by a 
"partial sort" hint on a column, such as follows. The partial sort order can be 
done at the row group level.

CREATE EXTERNAL TABLE functional.alltypesagg (                                  
                                                                                
                                  id INT,                                       
                                                                                
                                                                  bool_col 
BOOLEAN,                                                                        
                                                                                
                       tinyint_col TINYINT,                                     
                                                                                
                                                       smallint_col SMALLINT,   
                                                                                
                                                                                
       int_col INT,                                                             
                                                                                
                                       bigint_col BIGINT,                       
                                                                                
                                                                       
float_col FLOAT,                                                                
                                                                                
                                double_col DOUBLE,                              
                                                                                
                                                                date_string_col 
STRING,                                                                         
                                                                                
                string_col STRING,                                              
                                                                                
                                                timestamp_col TIMESTAMP         
                                                                                
                                                                                
)                                                                               
                                                                                
                                  PARTITIONED BY (                              
                                                                                
                                                                    year INT,   
                                                                                
                                                                                
                    month INT,                                                  
                                                                                
                                                    day INT                     
                                                                                
                                                                                
    )
PARTIALLY SORTED BY (string_col)



--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 22
Gerrit-Owner: Qifan Chen <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Tue, 01 Dec 2020 20:13:21 +0000
Gerrit-HasComments: Yes

Reply via email to