Michael Ho has posted comments on this change. Change subject: IMPALA-3989: Display skew warning for poorly formatted Parquet files ......................................................................
Patch Set 6: (6 comments) http://gerrit.cloudera.org:8080/#/c/5400/6/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: PS6, Line 318: Return Returns true if 'row_group' overlaps with 'split_range'. PS6, Line 333: (split_start <= row_group_start && split_end >= row_group_end); Why is the case (split_start >= row_group_start && split_end <= row_group_end) ? Isn't that the case here if a row group spans multiple block ? PS6, Line 463: (row_group_idx_ == -1) nit: parenthesis isn't necessary. PS6, Line 479: skipped all the row groups Mind adding a minor remark that we won't be in this path if there is at least one non-empty row group which this scanner can process ? PS6, Line 499: if (CheckRowGroupOverlapsSplit(row_group, split_range)) { : // If the row group overlaps the split but the mid-point does not fall within the : // split, we have a poorly formatted file. : misaligned_row_group_skipped = true; misaligned_row_group_skipped |= CheckRowGroupOverlapSplit(row_group, split_range); http://gerrit.cloudera.org:8080/#/c/5400/6/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: PS6, Line 446: Number of scanners Is it really number of scanners ? This counter can be bumped multiple times per scan range. -- To view, visit http://gerrit.cloudera.org:8080/5400 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibf48d978383d73efdade733a892e795ebd53c76a Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila Jeges <[email protected]> Gerrit-Reviewer: Attila Jeges <[email protected]> Gerrit-Reviewer: Michael Ho <[email protected]> Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]> Gerrit-HasComments: Yes
