I'd be in favor of maybe adding a flag that allows this type of schema
which is by default false.  Another option is if we can identify the writer
of these files, we can make an exception specifically for those versions?

On Wed, May 18, 2022 at 4:02 PM William Butler <[email protected]>
wrote:

> >
> > Well, why is UNKNOWN used here? This seems like a bug in the producer:
>
> It is a bug in the producer, see the JIRA, but now there are a lot of these
> files out there in the wild. Adding an explicit workaround might be
> reasonable given the pervasiveness.
>
> On Mon, May 16, 2022 at 4:57 AM Antoine Pitrou <[email protected]> wrote:
>
> > On Thu, 12 May 2022 09:46:57 -0700
> > William Butler <[email protected]>
> > wrote:
> > >
> > > From the JIRA, the converted type looks something like
> > >
> > >   required group FeatureAmounts (MAP) {
> > >     repeated group map (MAP_KEY_VALUE) {
> > >       required binary key (STRING);
> > >       required binary key (STRING);
> > >     }
> > >   }
> > >
> > >
> > > but the logical type looks like
> > >
> > >   required group FeatureAmounts (MAP) {
> > >     repeated group map (UNKNOWN) {
> > >       required binary key (STRING);
> > >       required binary key (STRING);
> > >     }
> > >   }
> > >
> > > Parquet C++ does not like that the UNKNOWN/NullLogicalType is being
> used
> > in
> > > the groups and rejects the schema with an exception.
> >
> > Well, why is UNKNOWN used here? This seems like a bug in the producer:
> > if MAP_KEY_VALUE does not have an equivalent logical type, then no
> > logical type annotation should be produced, instead of the "UNKNOWN"
> > logical type annotation which means that all values are null and the
> > "real" type of the data is therefore lost.
> >
> > (I understand that this is probably due to confusion from the misnaming
> > of the "UNKNOWN" logical type, which would have been more appropriately
> > named "ALWAYS_NULL" or similar)
> >
> > > The second example involves an INT64 column with a TIMESTAMP_MILLIS
> > > converted type but a String logical type. Parquet-mr in this example
> > > fallbacks to the timestamp converted type whereas Parquet C++ throws an
> > > exception.
> >
> > Well, I don't know why a String logical type should be accepted for
> > integer columns with a timestamp converted type.  The fact that
> > parquet-mr accepts it sounds like a bug in parquet-mr, IMO.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
>

Reply via email to