Zoltán Borók-Nagy created IMPALA-11134:
------------------------------------------

             Summary: Impala returns "Couldn't skip rows in file" error for old 
Parquet file
                 Key: IMPALA-11134
                 URL: https://issues.apache.org/jira/browse/IMPALA-11134
             Project: IMPALA
          Issue Type: Bug
            Reporter: Zoltán Borók-Nagy


Impala returns "Couldn't skip rows in file" error for old Parquet file written 
by an old Impala (e.g. Impala 2.5, 2.6)

In DEBUG build Impala crashes by a DCHECK:

{noformat}
F0217 18:21:34.449540 24288 parquet-column-readers.cc:1611] 
d3407555528be8a8:5ea3fceb00000001] Check failed: num_buffered_values_ > 0 (-1 
vs. 0)
{noformat}

The problem is that in some old Parquet files there can be a mismatch between 
'num_values' in a page and the encoded def/rep levels. There is usually one 
more def/rep levels encoded in these files.

In SkipTopLevelRows() we skip values based on how many def levels left:
https://github.com/apache/impala/blob/92ce6fe48e75d7780efe9a275122554e59aac916/be/src/exec/parquet/parquet-column-readers.cc#L1308-L1314

Since there are more def levels than values, {{num_buferred_values_}} becomes 
{{-1}}. I looked at Parquet files written by newer Impala and the number of def 
levels matches the number of values.

The workaround is fairly easy, we could also take the value of 
num_buferred_values_ into account when calculating 'read_count', i.e. 
min(min(num_buffered_values_, num_rows - i), repeated_run_length); so we can 
deal with such files.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to