You might also consider looking for fallback options. For instance, in 
https://github.com/apache/parquet-mr/pull/957, I figured out a good spot to 
catch the exception and then fall-back to a converted schema.

On 5/29/22, 1:53 PM, "Micah Kornfield" <[email protected]> wrote:

    CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



    I'd be in favor of maybe adding a flag that allows this type of schema
    which is by default false.  Another option is if we can identify the writer
    of these files, we can make an exception specifically for those versions?

    On Wed, May 18, 2022 at 4:02 PM William Butler <[email protected]>
    wrote:

    > >
    > > Well, why is UNKNOWN used here? This seems like a bug in the producer:
    >
    > It is a bug in the producer, see the JIRA, but now there are a lot of 
these
    > files out there in the wild. Adding an explicit workaround might be
    > reasonable given the pervasiveness.
    >
    > On Mon, May 16, 2022 at 4:57 AM Antoine Pitrou <[email protected]> wrote:
    >
    > > On Thu, 12 May 2022 09:46:57 -0700
    > > William Butler <[email protected]>
    > > wrote:
    > > >
    > > > From the JIRA, the converted type looks something like
    > > >
    > > >   required group FeatureAmounts (MAP) {
    > > >     repeated group map (MAP_KEY_VALUE) {
    > > >       required binary key (STRING);
    > > >       required binary key (STRING);
    > > >     }
    > > >   }
    > > >
    > > >
    > > > but the logical type looks like
    > > >
    > > >   required group FeatureAmounts (MAP) {
    > > >     repeated group map (UNKNOWN) {
    > > >       required binary key (STRING);
    > > >       required binary key (STRING);
    > > >     }
    > > >   }
    > > >
    > > > Parquet C++ does not like that the UNKNOWN/NullLogicalType is being
    > used
    > > in
    > > > the groups and rejects the schema with an exception.
    > >
    > > Well, why is UNKNOWN used here? This seems like a bug in the producer:
    > > if MAP_KEY_VALUE does not have an equivalent logical type, then no
    > > logical type annotation should be produced, instead of the "UNKNOWN"
    > > logical type annotation which means that all values are null and the
    > > "real" type of the data is therefore lost.
    > >
    > > (I understand that this is probably due to confusion from the misnaming
    > > of the "UNKNOWN" logical type, which would have been more appropriately
    > > named "ALWAYS_NULL" or similar)
    > >
    > > > The second example involves an INT64 column with a TIMESTAMP_MILLIS
    > > > converted type but a String logical type. Parquet-mr in this example
    > > > fallbacks to the timestamp converted type whereas Parquet C++ throws 
an
    > > > exception.
    > >
    > > Well, I don't know why a String logical type should be accepted for
    > > integer columns with a timestamp converted type.  The fact that
    > > parquet-mr accepts it sounds like a bug in parquet-mr, IMO.
    > >
    > > Regards
    > >
    > > Antoine.
    > >
    > >
    > >
    >

Reply via email to