Tim Armstrong has posted comments on this change. Change subject: IMPALA-3909: [DOCS] Add general info about Parquet min/max optimization ......................................................................
Patch Set 1: (2 comments) http://gerrit.cloudera.org:8080/#/c/7068/1/docs/topics/impala_parquet.xml File docs/topics/impala_parquet.xml: PS1, Line 363: data block Not sure what "data block" means. "each row group and data page" would be more precise. I feel like the current text may confuse readers about what is in Parquet files in general versus how Impala writes out files versus what Impala actually makes use of on the read path right now. Currently both Impala and other tools write out stats at both the row group and data page level. The data pages are a smaller granularity. Row groups are much larger granularity. I think the salient fact there is that there are typically a small number of row groups per file (1 for Impala). Impala currently only uses the row group-level statistics to skip over large parts of the file at a time, but we have plans to use the page-level statistics. PS1, Line 366: whether the file "parts of each file", because it could be a data page or row group. -- To view, visit http://gerrit.cloudera.org:8080/7068 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5fd5f7b157024f6089af7feffcb538c160bb130d Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: John Russell <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
