I know that Hive can deal with schema being different per partition, but I really hesitate to understand why someone would want to do this. If someone asked me to support a mixed avro/parquet table I would suggest they create a view. If they kept insisting I would reply "Well it is your funeral."
On Wed, Jul 11, 2018 at 7:51 PM, Todd Lipcon <t...@cloudera.com.invalid> wrote: > Hey folks, > > I'm trying to understand the current behavior of tables that contain > partitions of mixed format, specifically when one or more partitions is > stored as Avro. Impala seems to be doing a number of things which I find > surprising, and I'm not sure if they are intentional or should be > considered bugs. > > *Surprise 1*: the _presence_ of an Avro-formatted partition can change the > table schema > https://gist.github.com/74bdef8a69b558763e4453ac21313649 > > - create a table that is Parquet-formatted, but with an 'avro.schema.url' > property > - the Avro schema is ignored, and we see whatever schema we specified > (*makes > sense, because the table is Parquet)* > - add an partition > - set the new partition's format to Avro > - refresh the table > - the schema for the table now reflects the Avro schema, because it has at > least one Avro partition > > *Surprise 2*: the above is inconsistent with Hive and Spark > > Hive seems to still reflect the table-level defined schema, and ignore the > avro.schema.url property in this mixed scenario. That is to say, with the > state set up by the above, we have the following behavior: > > Impala: > - uses the external avro schema for all table-level info, SELECT *, etc. > - "compute stats" detects the inconsistency and tells the user to recreate > the table. > - if some existing partitions (eg in Parquet) aren't compatible with that > avro schema, errors result from the backend that there are missing columns > in the Parquet data files > > Hive: > - uses the table-level schema defined in the HMS for describe, etc > - queries like 'select *' again use the table-level HMS schema. The > underlying reader that reads the Avro partition seems to use the defined > external Avro schema, resulting in nulls for missing columns. > - computing stats (analyze table mixedtable partition (y=1) compute stats > for columns) seems to end up only recording stats against the column > defined in the table-level Schema. > > Spark: > - DESCRIBE TABLE shows the table-level info > - select * fails, because apparently Spark doesn't support multi-format > tables at all (it tries to read the avro files as a parquet file) > > > It seems to me that Hive's behavior is a bit better.* I'd like to propose > we treat this as a bug and move to the following behavior:* > > - if a table's properties indicate it's an avro table, parse and adopt the > external avro schema as the table schema > - if a table's properties indicate it's _not_ an avro table, but there is > an external avro schema defined in the table properties, then parse the > avro schema and include it in the TableDescriptor (for use by avro > partitions) but do not adopt it as the table schema. > > The added benefit of the above proposal (and the reason why I started > looking into this in the first place) is that, in order to service a simple > query like DESCRIBE, our current behavior requires all partition metadata > to be loaded to know whether there is any avro-formatted partition. With > the proposed new behavior, we can avoid looking at all partitions. This is > important for any metadata design which supports fine-grained loading of > metadata to the coordinator. > > -Todd > -- > Todd Lipcon > Software Engineer, Cloudera >