We've recently encountered files that have inconsistencies between the
number of rows specified in the row group [1] and the total number of
values in a column [2] for non-repeated columns (within a file there is
inconsistency between columns but all counts appear to be greater than or
equal to the number of rows). .

Two questions:
1.  Is anyone aware of parquet implementations that might generate files
like this?
2.  Does anyone have an opinion on the correct interpretation of these
files?  Should the files be treated as corrupt, or should the number of
rows be treated as authoritative and any additional data in a column be
truncated?

It appears different engines make different choices in this case.  Arrow
treats this as corruption. Spark seems to allow reading the data.

Thanks,
Micah


[1]
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L895
[2]
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L786

Reply via email to