Attila Jeges has posted comments on this change.

Change subject: IMPALA-3989: Display skew warning for poorly formatted Parquet 
files
......................................................................


Patch Set 6:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/5400/6/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

PS6, Line 318: Return 
> Returns true if 'row_group' overlaps with 'split_range'.
Done


PS6, Line 333: (split_start <= row_group_start && split_end >= row_group_end)
> Why is this an invalid case?
See Michael's comment below.


PS6, Line 333: (split_start <= row_group_start && split_end >= row_group_end);
> Why is the case (split_start >= row_group_start && split_end <= row_group_e
If the row group spans multiple splits, then either
- L331 (split_start >= row_group_start && split_start < row_group_end) is TRUE

OR 

- L332 (split_end > row_group_start && split_end <= row_group_end) is TRUE,

so we return TRUE.


PS6, Line 333: (split_start <= row_group_start && split_end >= row_group_end)
> If I understand it correctly, this function's sole purpose is to return TRU
Correct.

I've introduced this function because the Parquet file's last split may contain 
only the footer (or part of the footer) and no row groups at all. This can 
happen even if the Parquet file is well-formatted, so we shouldn't set the 
'misaligned_row_group_skipped' flag in L502.


PS6, Line 463: (row_group_idx_ == -1)
> nit: parenthesis isn't necessary.
Done


PS6, Line 479: skipped all the row groups
> Mind adding a minor remark that we won't be in this path if there is at lea
Done


PS6, Line 499:       if (CheckRowGroupOverlapsSplit(row_group, split_range)) {
             :         // If the row group overlaps the split but the mid-point 
does not fall within the
             :         // split, we have a poorly formatted file.
             :         misaligned_row_group_skipped = true;
> misaligned_row_group_skipped |= CheckRowGroupOverlapSplit(row_group, split_
Done


PS6, Line 546:  
> extra whitespace
Done


http://gerrit.cloudera.org:8080/#/c/5400/6/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

PS6, Line 446:  Number of scanners
> Is it really number of scanners ? This counter can be bumped multiple times
I could be mistaken but I believe it is the number of scanners.

The scan node creates a separate scanner object for each split and calls the 
scanner's ProcessSplit() function to process it. ProcessSplit() calls 
NextRowGroup() function to iterate over the row groups that belong to the split.

The counter is bumped only when NextRowGroup() is called the first time on a 
split but it cannot find a row group to return (because all the row groups in 
the split were misaligned). Since each split is scanned by a different scanner, 
the counter will be bumped only once at most per scanner.


http://gerrit.cloudera.org:8080/#/c/5400/6/common/thrift/generate_error_codes.py
File common/thrift/generate_error_codes.py:

PS6, Line 315: Row group size doesn't match HDFS block "
             :    "size.
> This would be clearer if you said:
Done


PS6, Line 319: Parquet file '$0' is poorly formatted. Please change 
fs.s3a.block.size to match the "
             :    "row group size of your file
> Here, it is a stretch to say that the Parquet file is poorly formatted. The
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5400
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibf48d978383d73efdade733a892e795ebd53c76a
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <[email protected]>
Gerrit-Reviewer: Attila Jeges <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Sailesh Mukil <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-HasComments: Yes

Reply via email to