Attila Jeges has posted comments on this change. Change subject: IMPALA-3989: Display skew warning for poorly formatted Parquet files ......................................................................
Patch Set 6: (11 comments) http://gerrit.cloudera.org:8080/#/c/5400/6/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: PS6, Line 318: Return > Returns true if 'row_group' overlaps with 'split_range'. Done PS6, Line 333: (split_start <= row_group_start && split_end >= row_group_end) > Why is this an invalid case? See Michael's comment below. PS6, Line 333: (split_start <= row_group_start && split_end >= row_group_end); > Why is the case (split_start >= row_group_start && split_end <= row_group_e If the row group spans multiple splits, then either - L331 (split_start >= row_group_start && split_start < row_group_end) is TRUE OR - L332 (split_end > row_group_start && split_end <= row_group_end) is TRUE, so we return TRUE. PS6, Line 333: (split_start <= row_group_start && split_end >= row_group_end) > If I understand it correctly, this function's sole purpose is to return TRU Correct. I've introduced this function because the Parquet file's last split may contain only the footer (or part of the footer) and no row groups at all. This can happen even if the Parquet file is well-formatted, so we shouldn't set the 'misaligned_row_group_skipped' flag in L502. PS6, Line 463: (row_group_idx_ == -1) > nit: parenthesis isn't necessary. Done PS6, Line 479: skipped all the row groups > Mind adding a minor remark that we won't be in this path if there is at lea Done PS6, Line 499: if (CheckRowGroupOverlapsSplit(row_group, split_range)) { : // If the row group overlaps the split but the mid-point does not fall within the : // split, we have a poorly formatted file. : misaligned_row_group_skipped = true; > misaligned_row_group_skipped |= CheckRowGroupOverlapSplit(row_group, split_ Done PS6, Line 546: > extra whitespace Done http://gerrit.cloudera.org:8080/#/c/5400/6/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: PS6, Line 446: Number of scanners > Is it really number of scanners ? This counter can be bumped multiple times I could be mistaken but I believe it is the number of scanners. The scan node creates a separate scanner object for each split and calls the scanner's ProcessSplit() function to process it. ProcessSplit() calls NextRowGroup() function to iterate over the row groups that belong to the split. The counter is bumped only when NextRowGroup() is called the first time on a split but it cannot find a row group to return (because all the row groups in the split were misaligned). Since each split is scanned by a different scanner, the counter will be bumped only once at most per scanner. http://gerrit.cloudera.org:8080/#/c/5400/6/common/thrift/generate_error_codes.py File common/thrift/generate_error_codes.py: PS6, Line 315: Row group size doesn't match HDFS block " : "size. > This would be clearer if you said: Done PS6, Line 319: Parquet file '$0' is poorly formatted. Please change fs.s3a.block.size to match the " : "row group size of your file > Here, it is a stretch to say that the Parquet file is poorly formatted. The Done -- To view, visit http://gerrit.cloudera.org:8080/5400 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibf48d978383d73efdade733a892e795ebd53c76a Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila Jeges <[email protected]> Gerrit-Reviewer: Attila Jeges <[email protected]> Gerrit-Reviewer: Michael Ho <[email protected]> Gerrit-Reviewer: Sailesh Mukil <[email protected]> Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]> Gerrit-HasComments: Yes
