[ 
https://issues.apache.org/jira/browse/HIVE-17714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250153#comment-16250153
 ] 

Sergey Shelukhin edited comment on HIVE-17714 at 11/13/17 8:10 PM:
-------------------------------------------------------------------

Hmm... I was writing the below, when I realized something we might be missing. 
So if this is resolved, the below applies, otherwise none of the above or below 
suggestions work as far as I can tell.
In order to store the derived schema in metastore, wouldn't we need the serde 
jar to be present in the first place? To ask it for the schema. Otherwise if we 
allow users to specify both columns and external schema, we are outsourcing 
even the initial correctness, which seems wrong.
I think it's reasonable to expect that if a SerDe is used, it should be 
available to the user (and metastore). I don't think having extra jars is a 
problem... the user will anyway have to have all the jars to actually query the 
table with the SerDe, right?

==== The below (without jars).

My main concern is about ensuring that the schema stored in metastore is synced 
with the actual schema by the serde. These can get out of sync from both sides; 
Hive columns can be added and altered despite the serde being present that is 
responsible for the schema (I filed a jira somewhere to block the modification 
like this) - these modifications will be visible to the users (because of the 
metastore APIs); for most serde-s however they won't reflect on the schema that 
Hive will actually use, so that is confusing.
Some serdes also support schema in external files that we have no control over, 
and other such mechanisms could exist.
Verifying schema at use time solves the problem for Hive, however not for other 
users of the metastore, which is kind of the point - Hive already ignores 
metastore columns for these tables, going instead to the SerDe, so the mismatch 
is not a problem for it. 
And adding such checks in metastore would mean needing access to jars, at which 
point we might as well return the correct schema.
How about this... 
1) We can remove the logic that avoids storing schema in metastore entirely, 
and always store the schema, like before.
2) Metastore will try to get SerDe class on reads, and if available, will 
return the schema from SerDe, or do a check as suggested above.
3) We could add a compat flag (like the one added for MM tables that fails 
getTable/etc calls for them unless the client explicitly claims to support MM 
tables, or disables compat checks)  that will break everyone trying to access 
such tables when the jars are absent (so the client is required to be aware of 
the potential discrepancy) unless they set a config flag to disable checks (so 
they know they might hit some rare issues), or actually implement the 
equivalent of get-from-deserializer.



was (Author: sershe):
Hmm... I was writing the below, when I realized something we might be missing. 
So if this is resolved, the below applies, otherwise none of the above or below 
suggestions work as far as I can tell.
In order to store the derived schema in metastore, wouldn't we need the serde 
jar to be present in the first place? To ask it for the schema. Otherwise if we 
allow users to specify both columns and external schema, we are outsourcing 
even the initial correctness, which seems wrong.
I think it's reasonable to expect that if a SerDe is used, it should be 
available to the user (and metastore). I don't think having extra jars is a 
problem... the user will anyway have to have all the jars to actually query the 
table with the SerDe, right?

==== The below (without jars).

My main concern is about ensuring that the schema stored in metastore is synced 
with the actual schema by the serde. These can get out of sync from both sides; 
Hive columns can be added and altered despite the serde being present that is 
responsible for the schema (I filed a jira somewhere to block the modification 
like this) - these modifications will be visible to the users (because of the 
metastore APIs); for most serde-s however they won't reflect on the schema that 
Hive will actually use, so that is confusing.
Some serdes also support schema in external files that we have no control over, 
and other such mechanisms could exist.
Verifying schema at use time solves the problem for Hive, however not for other 
users of the metastore, which is kind of the point - Hive already ignores 
metastore columns for these tables, going instead to the SerDe, so the mismatch 
is not a problem for it. 
And adding such checks in metastore would mean needing access to jars, at which 
point we might as well return the correct schema.
How about this... 
1) We can remove the logic that avoids storing schema in metastore entirely, 
and always store the schema, like before.
2) Metastore will try to get SerDe class on reads, and if available, will 
return the schema from SerDe, or do a compat check as suggested above.
3) We could add a compat flag (like the one added for MM tables that fails 
getTable/etc calls for them unless the client explicitly claims to support MM 
tables, or disables compat checks)  that will break everyone trying to access 
such tables when the jars are absent (so the client is required to be aware of 
the potential discrepancy) unless they set a config flag to disable checks (so 
they know they might hit some rare issues), or actually implement the 
equivalent of get-from-deserializer.


> move custom SerDe schema considerations into metastore from QL
> --------------------------------------------------------------
>
>                 Key: HIVE-17714
>                 URL: https://issues.apache.org/jira/browse/HIVE-17714
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Alan Gates
>
> Columns in metastore for tables that use external schema don't have the type 
> information (since HIVE-11985) and may be entirely inconsistent (since 
> forever, due to issues like HIVE-17713; or for SerDes that allow an URL for 
> the schema, due to a change in the underlying file).
> Currently, if you trace the usage of ConfVars.SERDESUSINGMETASTOREFORSCHEMA, 
> and to MetaStoreUtils.getFieldsFromDeserializer, you'd see that the code in 
> QL handles this in Hive. So, for the most part metastore just returns 
> whatever is stored for columns in the database.
> One exception appears to be get_fields_with_environment_context, which is 
> interesting... so getTable will return incorrect columns (potentially), but 
> get_fields/get_schema will return correct ones from SerDe as far as I can 
> tell.
> As part of separating the metastore, we should make sure all the APIs return 
> the correct schema for the columns; it's not a good idea to have everyone 
> reimplement getFieldsFromDeserializer.
> Note: this should also remove a flag introduced in HIVE-17731



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to