Re: Inconsistent handling of schema in Avro tables

Edward Capriolo Wed, 11 Jul 2018 19:48:25 -0700

I know that Hive can deal with schema being different per partition, but I
really hesitate to understand why someone would want to do this. If someone
asked me to support a mixed avro/parquet table I would suggest they create
a view. If they kept insisting I would reply "Well it is your funeral."


On Wed, Jul 11, 2018 at 7:51 PM, Todd Lipcon <t...@cloudera.com.invalid>
wrote:

> Hey folks,
>
> I'm trying to understand the current behavior of tables that contain
> partitions of mixed format, specifically when one or more partitions is
> stored as Avro. Impala seems to be doing a number of things which I find
> surprising, and I'm not sure if they are intentional or should be
> considered bugs.
>
> *Surprise 1*: the _presence_ of an Avro-formatted partition can change the
> table schema
> https://gist.github.com/74bdef8a69b558763e4453ac21313649
>
> - create a table that is Parquet-formatted, but with an 'avro.schema.url'
> property
> - the Avro schema is ignored, and we see whatever schema we specified
> (*makes
> sense, because the table is Parquet)*
> - add an partition
> - set the new partition's format to Avro
> - refresh the table
> - the schema for the table now reflects the Avro schema, because it has at
> least one Avro partition
>
> *Surprise 2*: the above is inconsistent with Hive and Spark
>
> Hive seems to still reflect the table-level defined schema, and ignore the
> avro.schema.url property in this mixed scenario. That is to say, with the
> state set up by the above, we have the following behavior:
>
> Impala:
> - uses the external avro schema for all table-level info, SELECT *, etc.
> - "compute stats" detects the inconsistency and tells the user to recreate
> the table.
> - if some existing partitions (eg in Parquet) aren't compatible with that
> avro schema, errors result from the backend that there are missing columns
> in the Parquet data files
>
> Hive:
> - uses the table-level schema defined in the HMS for describe, etc
> - queries like 'select *' again use the table-level HMS schema. The
> underlying reader that reads the Avro partition seems to use the defined
> external Avro schema, resulting in nulls for missing columns.
> - computing stats (analyze table mixedtable partition (y=1) compute stats
> for columns) seems to end up only recording stats against the column
> defined in the table-level Schema.
>
> Spark:
> - DESCRIBE TABLE shows the table-level info
> - select * fails, because apparently Spark doesn't support multi-format
> tables at all (it tries to read the avro files as a parquet file)
>
>
> It seems to me that Hive's behavior is a bit better.* I'd like to propose
> we treat this as a bug and move to the following behavior:*
>
> - if a table's properties indicate it's an avro table, parse and adopt the
> external avro schema as the table schema
> - if a table's properties indicate it's _not_ an avro table, but there is
> an external avro schema defined in the table properties, then parse the
> avro schema and include it in the TableDescriptor (for use by avro
> partitions) but do not adopt it as the table schema.
>
> The added benefit of the above proposal (and the reason why I started
> looking into this in the first place) is that, in order to service a simple
> query like DESCRIBE, our current behavior requires all partition metadata
> to be loaded to know whether there is any avro-formatted partition. With
> the proposed new behavior, we can avoid looking at all partitions. This is
> important for any metadata design which supports fine-grained loading of
> metadata to the coordinator.
>
> -Todd
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Inconsistent handling of schema in Avro tables

Reply via email to