[
https://issues.apache.org/jira/browse/HIVE-17714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250153#comment-16250153
]
Sergey Shelukhin edited comment on HIVE-17714 at 11/13/17 8:10 PM:
-------------------------------------------------------------------
Hmm... I was writing the below, when I realized something we might be missing.
So if this is resolved, the below applies, otherwise none of the above or below
suggestions work as far as I can tell.
In order to store the derived schema in metastore, wouldn't we need the serde
jar to be present in the first place? To ask it for the schema. Otherwise if we
allow users to specify both columns and external schema, we are outsourcing
even the initial correctness, which seems wrong.
I think it's reasonable to expect that if a SerDe is used, it should be
available to the user (and metastore). I don't think having extra jars is a
problem... the user will anyway have to have all the jars to actually query the
table with the SerDe, right?
==== The below (without jars).
My main concern is about ensuring that the schema stored in metastore is synced
with the actual schema by the serde. These can get out of sync from both sides;
Hive columns can be added and altered despite the serde being present that is
responsible for the schema (I filed a jira somewhere to block the modification
like this) - these modifications will be visible to the users (because of the
metastore APIs); for most serde-s however they won't reflect on the schema that
Hive will actually use, so that is confusing.
Some serdes also support schema in external files that we have no control over,
and other such mechanisms could exist.
Verifying schema at use time solves the problem for Hive, however not for other
users of the metastore, which is kind of the point - Hive already ignores
metastore columns for these tables, going instead to the SerDe, so the mismatch
is not a problem for it.
And adding such checks in metastore would mean needing access to jars, at which
point we might as well return the correct schema.
How about this...
1) We can remove the logic that avoids storing schema in metastore entirely,
and always store the schema, like before.
2) Metastore will try to get SerDe class on reads, and if available, will
return the schema from SerDe, or do a check as suggested above.
3) We could add a compat flag (like the one added for MM tables that fails
getTable/etc calls for them unless the client explicitly claims to support MM
tables, or disables compat checks) that will break everyone trying to access
such tables when the jars are absent (so the client is required to be aware of
the potential discrepancy) unless they set a config flag to disable checks (so
they know they might hit some rare issues), or actually implement the
equivalent of get-from-deserializer.
was (Author: sershe):
Hmm... I was writing the below, when I realized something we might be missing.
So if this is resolved, the below applies, otherwise none of the above or below
suggestions work as far as I can tell.
In order to store the derived schema in metastore, wouldn't we need the serde
jar to be present in the first place? To ask it for the schema. Otherwise if we
allow users to specify both columns and external schema, we are outsourcing
even the initial correctness, which seems wrong.
I think it's reasonable to expect that if a SerDe is used, it should be
available to the user (and metastore). I don't think having extra jars is a
problem... the user will anyway have to have all the jars to actually query the
table with the SerDe, right?
==== The below (without jars).
My main concern is about ensuring that the schema stored in metastore is synced
with the actual schema by the serde. These can get out of sync from both sides;
Hive columns can be added and altered despite the serde being present that is
responsible for the schema (I filed a jira somewhere to block the modification
like this) - these modifications will be visible to the users (because of the
metastore APIs); for most serde-s however they won't reflect on the schema that
Hive will actually use, so that is confusing.
Some serdes also support schema in external files that we have no control over,
and other such mechanisms could exist.
Verifying schema at use time solves the problem for Hive, however not for other
users of the metastore, which is kind of the point - Hive already ignores
metastore columns for these tables, going instead to the SerDe, so the mismatch
is not a problem for it.
And adding such checks in metastore would mean needing access to jars, at which
point we might as well return the correct schema.
How about this...
1) We can remove the logic that avoids storing schema in metastore entirely,
and always store the schema, like before.
2) Metastore will try to get SerDe class on reads, and if available, will
return the schema from SerDe, or do a compat check as suggested above.
3) We could add a compat flag (like the one added for MM tables that fails
getTable/etc calls for them unless the client explicitly claims to support MM
tables, or disables compat checks) that will break everyone trying to access
such tables when the jars are absent (so the client is required to be aware of
the potential discrepancy) unless they set a config flag to disable checks (so
they know they might hit some rare issues), or actually implement the
equivalent of get-from-deserializer.
> move custom SerDe schema considerations into metastore from QL
> --------------------------------------------------------------
>
> Key: HIVE-17714
> URL: https://issues.apache.org/jira/browse/HIVE-17714
> Project: Hive
> Issue Type: Bug
> Reporter: Sergey Shelukhin
> Assignee: Alan Gates
>
> Columns in metastore for tables that use external schema don't have the type
> information (since HIVE-11985) and may be entirely inconsistent (since
> forever, due to issues like HIVE-17713; or for SerDes that allow an URL for
> the schema, due to a change in the underlying file).
> Currently, if you trace the usage of ConfVars.SERDESUSINGMETASTOREFORSCHEMA,
> and to MetaStoreUtils.getFieldsFromDeserializer, you'd see that the code in
> QL handles this in Hive. So, for the most part metastore just returns
> whatever is stored for columns in the database.
> One exception appears to be get_fields_with_environment_context, which is
> interesting... so getTable will return incorrect columns (potentially), but
> get_fields/get_schema will return correct ones from SerDe as far as I can
> tell.
> As part of separating the metastore, we should make sure all the APIs return
> the correct schema for the columns; it's not a good idea to have everyone
> reimplement getFieldsFromDeserializer.
> Note: this should also remove a flag introduced in HIVE-17731
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)