Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/21559 )
Change subject: IMPALA-13161: Fix column index overflow in DelimitedTextParser ...................................................................... Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/21559/2/be/src/exec/delimited-text-parser.inline.h File be/src/exec/delimited-text-parser.inline.h: http://gerrit.cloudera.org:8080/#/c/21559/2/be/src/exec/delimited-text-parser.inline.h@76 PS2, Line 76: if (column_idx_ < num_cols_) ++column_idx_; > That file would be ambiguous if it had more than 1 column? How does the beh The file won't be ambiguous. How to read and materialize the data is defined by the table schema (i.e. column types) and the tuple descriptor (i.e. what the query wants, mainly a bool array of 'is_materialized_col_'). How many columns in a text line is determined by the field delimiters. If one line has N field delimiters, then N+1 columns can be read out from that line. However, it depends on the table schema and tuple descriptor to select the columns. As this comment mentions, we use a two pass parsing approach: https://github.com/apache/impala/blob/e2e45401e2bead4090fd5c562709db521cbc6d38/be/src/exec/hdfs-scanner.h#L60-L77 The first pass fills out the FieldLocation structs which contain where all the fields start and their lengths in the byte buffer. The second pass uses the FieldLocation to materialize slots in the tuples. AddColumn() is used in the first pass. If a text line has more columns than the table schema, they won't be added into the FieldLocations. This is judged by ReturnCurrentColumn(): bool ReturnCurrentColumn() const { return column_idx_ < num_cols_ && is_materialized_col_[column_idx_]; } 'column_idx_' is the index of the column in the text line. It starts from the number of partitions (not starts from 0). It's only used to compare with num_cols_ and num_partition_keys_ so we don't need to bump it if it's already larger than num_cols_. The original code keeps bumping it even if it's overflowed. Then column_idx_ < num_cols_ passes and accessing is_materialized_col_[column_idx_] crashes. With this fix, column_idx_ won't be larger than num_cols_ so won't cause the crash. This is the only behavior change and it's not aware by the user or any caller of this function. -- To view, visit http://gerrit.cloudera.org:8080/21559 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I527a8971e92e270d5576c2155e4622dd6d43d745 Gerrit-Change-Number: 21559 Gerrit-PatchSet: 2 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Michael Smith <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Comment-Date: Fri, 12 Jul 2024 06:27:42 +0000 Gerrit-HasComments: Yes
