Re: Inconsistent handling of schema in Avro tables

Bharath Vissapragada Wed, 11 Jul 2018 20:23:41 -0700

I added this functionality
<https://github.com/apache/impala/commit/49610e2cfa40aa10b626c5ae41d7f0d99d7cabc5>
 where adding an Avro partition in a mixed partition table resets the table
level schema. While I don't exactly remember why we chose this path, I do
recall that we debated quite a bit about Avro schema evolution causing
schema inconsistencies across partitions. AFAICT there is no specific
reason Impala chose to different from Hive. Now that I see your email,
Hive's behavior makes more sense to me, especially in the context of lazy
loading of metadata.


Also, agree with Edward that the whole mixed partitions + Avro schema
evolution is a mess and I doubt if any serious user relies on a specific
behavior.

On Wed, Jul 11, 2018 at 7:48 PM Edward Capriolo <[email protected]>
wrote:

> I know that Hive can deal with schema being different per partition, but I
> really hesitate to understand why someone would want to do this. If someone
> asked me to support a mixed avro/parquet table I would suggest they create
> a view. If they kept insisting I would reply "Well it is your funeral."
>
> On Wed, Jul 11, 2018 at 7:51 PM, Todd Lipcon <[email protected]>
> wrote:
>
> > Hey folks,
> >
> > I'm trying to understand the current behavior of tables that contain
> > partitions of mixed format, specifically when one or more partitions is
> > stored as Avro. Impala seems to be doing a number of things which I find
> > surprising, and I'm not sure if they are intentional or should be
> > considered bugs.
> >
> > *Surprise 1*: the _presence_ of an Avro-formatted partition can change
> the
> > table schema
> > https://gist.github.com/74bdef8a69b558763e4453ac21313649
> >
> > - create a table that is Parquet-formatted, but with an 'avro.schema.url'
> > property
> > - the Avro schema is ignored, and we see whatever schema we specified
> > (*makes
> > sense, because the table is Parquet)*
> > - add an partition
> > - set the new partition's format to Avro
> > - refresh the table
> > - the schema for the table now reflects the Avro schema, because it has
> at
> > least one Avro partition
> >
> > *Surprise 2*: the above is inconsistent with Hive and Spark
> >
> > Hive seems to still reflect the table-level defined schema, and ignore
> the
> > avro.schema.url property in this mixed scenario. That is to say, with the
> > state set up by the above, we have the following behavior:
> >
> > Impala:
> > - uses the external avro schema for all table-level info, SELECT *, etc.
> > - "compute stats" detects the inconsistency and tells the user to
> recreate
> > the table.
> > - if some existing partitions (eg in Parquet) aren't compatible with that
> > avro schema, errors result from the backend that there are missing
> columns
> > in the Parquet data files
> >
> > Hive:
> > - uses the table-level schema defined in the HMS for describe, etc
> > - queries like 'select *' again use the table-level HMS schema. The
> > underlying reader that reads the Avro partition seems to use the defined
> > external Avro schema, resulting in nulls for missing columns.
> > - computing stats (analyze table mixedtable partition (y=1) compute stats
> > for columns) seems to end up only recording stats against the column
> > defined in the table-level Schema.
> >
> > Spark:
> > - DESCRIBE TABLE shows the table-level info
> > - select * fails, because apparently Spark doesn't support multi-format
> > tables at all (it tries to read the avro files as a parquet file)
> >
> >
> > It seems to me that Hive's behavior is a bit better.* I'd like to propose
> > we treat this as a bug and move to the following behavior:*
> >
> > - if a table's properties indicate it's an avro table, parse and adopt
> the
> > external avro schema as the table schema
> > - if a table's properties indicate it's _not_ an avro table, but there is
> > an external avro schema defined in the table properties, then parse the
> > avro schema and include it in the TableDescriptor (for use by avro
> > partitions) but do not adopt it as the table schema.
> >
> > The added benefit of the above proposal (and the reason why I started
> > looking into this in the first place) is that, in order to service a
> simple
> > query like DESCRIBE, our current behavior requires all partition metadata
> > to be loaded to know whether there is any avro-formatted partition. With
> > the proposed new behavior, we can avoid looking at all partitions. This
> is
> > important for any metadata design which supports fine-grained loading of
> > metadata to the coordinator.
> >
> > -Todd
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>

Re: Inconsistent handling of schema in Avro tables

Reply via email to