Pooja Nilangekar has posted comments on this change. Change subject: Populate OffsetIndex and ColumnIndex of a row_group and Filter pages ......................................................................
Patch Set 2: (20 comments) http://gerrit.cloudera.org:8080/#/c/7465/1/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 617: } > Have you considered moving this into its own method? Done http://gerrit.cloudera.org:8080/#/c/7465/1/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: Line 1205: !current_row_group_->sorting_columns.empty(); > Couldn't it happen that columns not in the sort_columns() list are sorted? Done Line 1218: std::vector<parquet::OffsetIndex> offset_indexes(columns_.size()); > I think it may be easier to track the previous value and then decide in Pro Done PS1, Line 1217: rquet::RowGroup* row_group = &(file_metadata_.row_groups[0]); : std::vector<parquet::OffsetIndex> offset_indexes(columns_.size()); : uint8_t* buffer = nullptr; : uint32_t len = 0; : for (int i = 0; i < columns_.size(); i++) { : : parquet::OffsetIndex offset_index; : offset_index.__set_page_loca > It's unclear to me how this works. Does it go through the min_values and ma Done http://gerrit.cloudera.org:8080/#/c/7465/1/be/src/exec/parquet-index-filter.cc File be/src/exec/parquet-index-filter.cc: Line 39: if (!min_max_tuple_desc_) return Status::OK(); > Can you add a comment for this line? It's unclear to me why it's necessary. Done PS1, Line 45: ges > This could have a better name, e.g. skip_ranges or ranges_to_skip Done PS1, Line 45: o > nit: you can remove the (0) here and in all other vector ctors. It's the de Done Line 46: vector<RowRange> skip_ranges; > move this to where it's used, think of a better name Done Line 150: std::sort(skip_ranges.begin(), skip_ranges.end(), std::less<RowRange>()); > Describe what this variable stores. Maybe also explain in a high level that Done PS1, Line 152: is block computes the unio > You could also use 'for (auto& row_range : skip) ...' Done PS1, Line 158: If the skip begins before > Is this check really needed? Yes, it is necessary when the current skip range ends before the previous skip range ends. In that case, the end of the skip_range should not be updated to ensure correctness. PS1, Line 178: > nit: use pre-increment Done Line 181: > Add a comment explaining why you need to do this. The vectors should be of Done Line 196: } else > resize() should not be needed. You could also use http://en.cppreference.co Done Line 199: FilteredPageInfo page{.page_id = page_idx, > One variable per line makes it more readable. Done PS1, Line 215: } > add {}, here and below. This helps prevent mistakes when changing the code Done Line 224: page.ranges.push_back(std::make_pair(start_offset, end_offset)); > Statements that fit into a single line are ok. Done http://gerrit.cloudera.org:8080/#/c/7465/1/be/src/exec/parquet-index-filter.h File be/src/exec/parquet-index-filter.h: Line 37: /// Pair of int64_t to represent the start and end of a range of rows. Each boundary is > Mention whether each boundary is inclusive. In my experience it's most comm Done Line 45: int64_t offset; > Maybe "PageReadInfo" or something similar makes it more clear that this str Done Line 50: typedef vector<FilteredPageInfo> FilteredPages; > Here you could add a comment explaining the overall usage of this class: Wh Done -- To view, visit http://gerrit.cloudera.org:8080/7465 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Idace1e57067f95973cef3567eeb84f2ad87fd3f6 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Pooja Nilangekar <pooja.nilange...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Pooja Nilangekar <pooja.nilange...@cloudera.com> Gerrit-HasComments: Yes