Re: Inconsistent handling of schema in Avro tables

Bharath Vissapragada Thu, 12 Jul 2018 17:08:58 -0700

On Thu, Jul 12, 2018 at 12:03 PM Todd Lipcon <[email protected]>
wrote:


> Again there's inconsistency with Hive: the presence of a single Avro
> partition doesn't change the table-level schema.
>
> The interesting thing is that, when I modified Impala to have a similar
> behavior, I got the following error from the backend when trying to query
> the data:
>
> WARNINGS: Unresolvable types for column 'tinyint_col': declared column
> type: TINYINT, table's Avro schema type: int (1 of 2 similar)
>
> Hive, however, when presented with such a situation, does an incorrect type
> coercion of the avro file's "int" into the table's "tinyint" schema. What's
> incorrect about the type coercion is that it doesn't handle out-of-range
> values in a sensible way. Rather, it just lets them overflow into the
> smaller target type -- when I dumped an Avro file containing the int
> "10000" into an Avro partition of a Parquet table with schema "tinyint", I
> got the value "16" output from Hive (10000%256=16). This downcasting
> behavior is consistent with Impala and Hive's behavior with a CAST(10000 as
> tinyint) expression (see IMPALA-1821)
>
>
> So, I think my proposal here is:
>
> 1. Query behavior on existing tables
> - If the table-level format is non-Avro,
> - AND the table contains column types incompatible with Avro (eg tinyint),
> - AND the table has an existing avro partition,
> - THEN the query will yield an error about incompatible types
>
> 2. Try to prevent shooting in the foot
> - If the table-level format is non-Avro,
> - AND the table contains column types incompatible with Avro (eg tinyint),
> - THEN disallow changing the file format of an existing partition to Avro
>

This approach sounds good to me. Is the existing behavior preserved for the
non-mixed-partition Avro based tables?


>
>
> -Todd
>
>
>
>
>
>
> On Wed, Jul 11, 2018 at 9:32 PM, Todd Lipcon <[email protected]> wrote:
>
> > Turns out it's even a bit more messy. The presence of one or more avro
> > partitions can change the types of existing columns, even if there is no
> > explicit avro schema specified for the table:
> > https://gist.github.com/5018d6ff50f846c72762319eb7cf5ca8
> >
> > Not quite sure how to handle this one in a world where we don't load all
> > of the partitions up front. Perhaps the best approach is to just throw an
> > error and then provide a command for the user to "re-sync" the schema to
> > the appropriate avro-supported types? Hive provides ALTER TABLE <foo>
> > UPDATE COLUMNS for something like this, though still I don't think that
> > would iterate over all partitions in the case of a mixed table.
>

Ouch, this is bad. Throwing error sounds reasonable to me, if the types are
not Avro compatible.


> >
> > -Todd
> >
> > On Wed, Jul 11, 2018 at 9:03 PM, Bharath Vissapragada <
> > [email protected]> wrote:
> >
> >> Agreed.
> >>
> >> On Wed, Jul 11, 2018 at 8:55 PM Todd Lipcon <[email protected]>
> >> wrote:
> >>
> >> > Your commit message there makes sense, Bharath -- we should set
> >> > 'avroSchema' in the descriptor in case any referenced partition is
> avro,
> >> > because the scanner needs that info. However, we don't need to also
> >> > override the table-level schema. So, I think we can preserve the fix
> >> that
> >> > you made while also making the behavior less surprising.
> >> >
> >> > -Todd
> >> >
> >> > On Wed, Jul 11, 2018 at 8:21 PM, Bharath Vissapragada <
> >> > [email protected]> wrote:
> >> >
> >> > > I added this functionality
> >> > > <https://github.com/apache/impala/commit/49610e2cfa40aa10b62
> >> 6c5ae41d7f0
> >> > > d99d7cabc5>
> >> > >  where adding an Avro partition in a mixed partition table resets
> the
> >> > table
> >> > > level schema. While I don't exactly remember why we chose this path,
> >> I do
> >> > > recall that we debated quite a bit about Avro schema evolution
> causing
> >> > > schema inconsistencies across partitions. AFAICT there is no
> specific
> >> > > reason Impala chose to different from Hive. Now that I see your
> email,
> >> > > Hive's behavior makes more sense to me, especially in the context of
> >> lazy
> >> > > loading of metadata.
> >> > >
> >> > > Also, agree with Edward that the whole mixed partitions + Avro
> schema
> >> > > evolution is a mess and I doubt if any serious user relies on a
> >> specific
> >> > > behavior.
> >> > >
> >> > > On Wed, Jul 11, 2018 at 7:48 PM Edward Capriolo <
> >> [email protected]>
> >> > > wrote:
> >> > >
> >> > > > I know that Hive can deal with schema being different per
> partition,
> >> > but
> >> > > I
> >> > > > really hesitate to understand why someone would want to do this.
> If
> >> > > someone
> >> > > > asked me to support a mixed avro/parquet table I would suggest
> they
> >> > > create
> >> > > > a view. If they kept insisting I would reply "Well it is your
> >> funeral."
> >> > > >
> >> > > > On Wed, Jul 11, 2018 at 7:51 PM, Todd Lipcon
> >> <[email protected]
> >> > >
> >> > > > wrote:
> >> > > >
> >> > > > > Hey folks,
> >> > > > >
> >> > > > > I'm trying to understand the current behavior of tables that
> >> contain
> >> > > > > partitions of mixed format, specifically when one or more
> >> partitions
> >> > is
> >> > > > > stored as Avro. Impala seems to be doing a number of things
> which
> >> I
> >> > > find
> >> > > > > surprising, and I'm not sure if they are intentional or should
> be
> >> > > > > considered bugs.
> >> > > > >
> >> > > > > *Surprise 1*: the _presence_ of an Avro-formatted partition can
> >> > change
> >> > > > the
> >> > > > > table schema
> >> > > > > https://gist.github.com/74bdef8a69b558763e4453ac21313649
> >> > > > >
> >> > > > > - create a table that is Parquet-formatted, but with an
> >> > > 'avro.schema.url'
> >> > > > > property
> >> > > > > - the Avro schema is ignored, and we see whatever schema we
> >> specified
> >> > > > > (*makes
> >> > > > > sense, because the table is Parquet)*
> >> > > > > - add an partition
> >> > > > > - set the new partition's format to Avro
> >> > > > > - refresh the table
> >> > > > > - the schema for the table now reflects the Avro schema, because
> >> it
> >> > has
> >> > > > at
> >> > > > > least one Avro partition
> >> > > > >
> >> > > > > *Surprise 2*: the above is inconsistent with Hive and Spark
> >> > > > >
> >> > > > > Hive seems to still reflect the table-level defined schema, and
> >> > ignore
> >> > > > the
> >> > > > > avro.schema.url property in this mixed scenario. That is to say,
> >> with
> >> > > the
> >> > > > > state set up by the above, we have the following behavior:
> >> > > > >
> >> > > > > Impala:
> >> > > > > - uses the external avro schema for all table-level info, SELECT
> >> *,
> >> > > etc.
> >> > > > > - "compute stats" detects the inconsistency and tells the user
> to
> >> > > > recreate
> >> > > > > the table.
> >> > > > > - if some existing partitions (eg in Parquet) aren't compatible
> >> with
> >> > > that
> >> > > > > avro schema, errors result from the backend that there are
> missing
> >> > > > columns
> >> > > > > in the Parquet data files
> >> > > > >
> >> > > > > Hive:
> >> > > > > - uses the table-level schema defined in the HMS for describe,
> etc
> >> > > > > - queries like 'select *' again use the table-level HMS schema.
> >> The
> >> > > > > underlying reader that reads the Avro partition seems to use the
> >> > > defined
> >> > > > > external Avro schema, resulting in nulls for missing columns.
> >> > > > > - computing stats (analyze table mixedtable partition (y=1)
> >> compute
> >> > > stats
> >> > > > > for columns) seems to end up only recording stats against the
> >> column
> >> > > > > defined in the table-level Schema.
> >> > > > >
> >> > > > > Spark:
> >> > > > > - DESCRIBE TABLE shows the table-level info
> >> > > > > - select * fails, because apparently Spark doesn't support
> >> > multi-format
> >> > > > > tables at all (it tries to read the avro files as a parquet
> file)
> >> > > > >
> >> > > > >
> >> > > > > It seems to me that Hive's behavior is a bit better.* I'd like
> to
> >> > > propose
> >> > > > > we treat this as a bug and move to the following behavior:*
> >> > > > >
> >> > > > > - if a table's properties indicate it's an avro table, parse and
> >> > adopt
> >> > > > the
> >> > > > > external avro schema as the table schema
> >> > > > > - if a table's properties indicate it's _not_ an avro table, but
> >> > there
> >> > > is
> >> > > > > an external avro schema defined in the table properties, then
> >> parse
> >> > the
> >> > > > > avro schema and include it in the TableDescriptor (for use by
> avro
> >> > > > > partitions) but do not adopt it as the table schema.
> >> > > > >
> >> > > > > The added benefit of the above proposal (and the reason why I
> >> started
> >> > > > > looking into this in the first place) is that, in order to
> >> service a
> >> > > > simple
> >> > > > > query like DESCRIBE, our current behavior requires all partition
> >> > > metadata
> >> > > > > to be loaded to know whether there is any avro-formatted
> >> partition.
> >> > > With
> >> > > > > the proposed new behavior, we can avoid looking at all
> partitions.
> >> > This
> >> > > > is
> >> > > > > important for any metadata design which supports fine-grained
> >> loading
> >> > > of
> >> > > > > metadata to the coordinator.
> >> > > > >
> >> > > > > -Todd
> >> > > > > --
> >> > > > > Todd Lipcon
> >> > > > > Software Engineer, Cloudera
> >> > > > >
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Todd Lipcon
> >> > Software Engineer, Cloudera
> >> >
> >>
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Inconsistent handling of schema in Avro tables

Reply via email to