Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/18257 )
Change subject: IMPALA-11134: Impala returns "Couldn't skip rows in file" error for old Parquet file ...................................................................... IMPALA-11134: Impala returns "Couldn't skip rows in file" error for old Parquet file Impala returns "Couldn't skip rows in file" error for old Parquet file written by an old Impala (e.g. Impala 2.5, 2.6) In DEBUG build Impala crashes by a DCHECK: Check failed: num_buffered_values_ > 0 (-1 vs. 0) The problem is that in some old Parquet files there can be a mismatch between 'num_values' in a page and the encoded def/rep levels. There is usually one more def/rep levels encoded in these files. In SkipTopLevelRows() we skipped values based on how many def levels are https://github.com/apache/impala/blob/92ce6fe48e75d7780efe9a275122554e59aac916/be/src/exec/parquet/parquet-column-readers.cc#L1308-L1314 Since there are more def levels than values in some old files, num_buferred_values_ could become negative. This patch also takes the value of num_buferred_values_ into account when calculating 'read_count', so we can deal with such files. With this patch we also include the column name in the "Couldn't skip rows" error message, so in the future it'll be easier to identify the problematic columns. Testing: * added Parquet file written by Impala 2.5 and e2e test for it Change-Id: I568fe59df720ea040be4926812412ba4c1510a26 Reviewed-on: http://gerrit.cloudera.org:8080/18257 Reviewed-by: Impala Public Jenkins <[email protected]> Tested-by: Impala Public Jenkins <[email protected]> --- M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/parquet-column-readers.cc M common/thrift/generate_error_codes.py M testdata/data/README A testdata/data/too_many_def_levels.parquet M tests/query_test/test_scanners.py 6 files changed, 28 insertions(+), 4 deletions(-) Approvals: Impala Public Jenkins: Looks good to me, approved; Verified -- To view, visit http://gerrit.cloudera.org:8080/18257 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I568fe59df720ea040be4926812412ba4c1510a26 Gerrit-Change-Number: 18257 Gerrit-PatchSet: 5 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]>
