Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/16720 )
Change subject: [WIP] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate ...................................................................... Patch Set 22: (1 comment) http://gerrit.cloudera.org:8080/#/c/16720/12//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16720/12//COMMIT_MSG@9 PS12, Line 9: This patch adds the logic to utilize min/max stats > Okay. Here is another good use case select a.l_orderkey from lineitem a join [SHUFFLE] orders b where a.l_orderkey = b.o_orderkey and b.o_custkey = 5 order by l_orderkey ; where 92 pages were filtered out in one impalad. The reason for the good filtering efficiency is that the low values in pages on l_orderkey are somewhat monotonic. In ::EvalOverlap(), data: min=1, max=4275687, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=1, max=3228897, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=17959, max=3246245, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=35235, max=3263680, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=52737, max=3280196, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=68708, max=3290466, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=79142, max=3294566, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=83169, max=3298692, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=87237, max=3302721, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=91328, max=3306725, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=95329, max=3310720, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=99334, max=3314819, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=103428, max=3318818, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=107458, max=3322980, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=111462, max=3327015, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=115591, max=3331171, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=119712, max=3335331, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=123810, max=3339429, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=127971, max=3343553, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=131973, max=3347585, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=136102, max=3351588, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=140354, max=3355745, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=144422, max=3359684, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=148544, max=3363814, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=152546, max=3367907, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=156640, max=3372006, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=160582, max=3376130, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=164677, max=3380229, filter: min=224167, max=2630562 In ::EvalOverlap(), data: min=168770, max=3384258, filter: min=224167, max=2630562 It seems we can help create such an arrangement for a Parquet data file by a "partial sort" hint on a column, such as follows. The partial sort order can be done at the row group level. CREATE EXTERNAL TABLE functional.alltypesagg ( id INT, bool_col BOOLEAN, tinyint_col TINYINT, smallint_col SMALLINT, int_col INT, bigint_col BIGINT, float_col FLOAT, double_col DOUBLE, date_string_col STRING, string_col STRING, timestamp_col TIMESTAMP ) PARTITIONED BY ( year INT, month INT, day INT ) PARTIALLY SORTED BY (string_col) -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 22 Gerrit-Owner: Qifan Chen <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Tue, 01 Dec 2020 20:13:21 +0000 Gerrit-HasComments: Yes
