Hi Gang, For writes I'm seeing "parquet-mr version 1.11.1" and "parquet-mr version 1.10.1". I need to look more into the page headers to check for consistency. At the column level, in some cases the number of values read by pyarrow is consistent with num_rows and in some cases it is consistent with num_values. I don't see any discernable pattern based on schema or types.
It looks like the parquet files might have been written with avro ("parquet.avro.schema" key and a corresponding schema are present in their metadata). Thanks, Micah On Tue, Nov 28, 2023 at 6:30 PM Gang Wu <ust...@gmail.com> wrote: > Hi Micah, > > Does the FileMetaData.version [1] provide any information about > the writer? What about the num_values in each page header? Is > the actual number of values consistent with num_values in the > ColumnMetaData? > > [1] > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1108 > > Best, > Gang > > On Wed, Nov 29, 2023 at 2:22 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > We've recently encountered files that have inconsistencies between the > > number of rows specified in the row group [1] and the total number of > > values in a column [2] for non-repeated columns (within a file there is > > inconsistency between columns but all counts appear to be greater than or > > equal to the number of rows). . > > > > Two questions: > > 1. Is anyone aware of parquet implementations that might generate files > > like this? > > 2. Does anyone have an opinion on the correct interpretation of these > > files? Should the files be treated as corrupt, or should the number of > > rows be treated as authoritative and any additional data in a column be > > truncated? > > > > It appears different engines make different choices in this case. Arrow > > treats this as corruption. Spark seems to allow reading the data. > > > > Thanks, > > Micah > > > > > > [1] > > > > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L895 > > [2] > > > > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L786 > > >