Hi Gang,
For writes I'm seeing "parquet-mr version 1.11.1" and "parquet-mr version
1.10.1".  I need to look more into the page headers to check for
consistency.  At the column level, in some cases the number of values read
by pyarrow is consistent with num_rows and in some cases it is consistent
with num_values. I don't see any discernable pattern based on schema or
types.

It looks like the parquet files might have been written with
avro ("parquet.avro.schema" key and a corresponding schema are present in
their metadata).

Thanks,
Micah

On Tue, Nov 28, 2023 at 6:30 PM Gang Wu <ust...@gmail.com> wrote:

> Hi Micah,
>
> Does the FileMetaData.version [1] provide any information about
> the writer? What about the num_values in each page header? Is
> the actual number of values consistent with num_values in the
> ColumnMetaData?
>
> [1]
>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1108
>
> Best,
> Gang
>
> On Wed, Nov 29, 2023 at 2:22 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
> > We've recently encountered files that have inconsistencies between the
> > number of rows specified in the row group [1] and the total number of
> > values in a column [2] for non-repeated columns (within a file there is
> > inconsistency between columns but all counts appear to be greater than or
> > equal to the number of rows). .
> >
> > Two questions:
> > 1.  Is anyone aware of parquet implementations that might generate files
> > like this?
> > 2.  Does anyone have an opinion on the correct interpretation of these
> > files?  Should the files be treated as corrupt, or should the number of
> > rows be treated as authoritative and any additional data in a column be
> > truncated?
> >
> > It appears different engines make different choices in this case.  Arrow
> > treats this as corruption. Spark seems to allow reading the data.
> >
> > Thanks,
> > Micah
> >
> >
> > [1]
> >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L895
> > [2]
> >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L786
> >
>

Reply via email to