[
https://issues.apache.org/jira/browse/HIVE-17714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16248970#comment-16248970
]
Vihang Karajgaonkar commented on HIVE-17714:
--------------------------------------------
I looked into this a bit more and followed the history of the changes to SerDes
related to this. Initially, I thought of move Serializer, Deserializer and
AbstractSerde classes to storage-api. This turned out to be pretty
straight-forward with no backward compatibility implications since the package
name still remains the same of the moved classes.
However, this may not solve the problem entirely because it still means that
standalone Metastore JVM will need these jars in its classpath to instantiate
and get the schema from Deserializer in the runtime. SerDe implementations are
spread all over the code and I am afraid that bringing one jar will bring in
the rest of the world in terms of dependencies. This is probably not an issue
in embedded mode of metastore though because metastore resides in the HS2
process and will have access to all the hive jars anyways, but in case of
remote standalone metastore it doesn't make sense to add all these jars in the
class path in the runtime.
I also was a bit confused by this [line of code here |
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java#L980
] in {{Table.java}} where it says that any SerDe which is a subclass of
AbstractSerDe should store the fields information in metastore. While
{{AbstractSerDe}} itself returns {{false}} in {{shouldStoreFieldsInMetastore}}
which is contradictory.
Based on what I have looked so far there is no easy way out for this and
HIVE-17580 to solve it consistently for all the use-cases without breaking
backwards compatibility. I propose we make the following changes:
1. Change {{AbstractSerDe:shouldStoreFieldsInMetastore}} to return {{true}}
It still behaves as if its true based on what we see in Table.java above and
claim that all the SerDes implementations which extend from AbstractSerDe will
store schema in metastore unless explicitly overridden to return false. This
should cover all the SerDes in Hive source code since HIVE-15167 moved them to
subclass from AbstractSerDe instead of directly implementing interfaces.
2. We move the Serializer, Deserializer and AbstractSerDe classes to
storage-api.
This enables metastore to consume them without having to create a compile time
dependency on hive.
3. We claim that if there are users who implement directly from the
Serializer/Deserializer interfaces and still want metastore to store schema for
them should make sure that their jar can be added into the classpath of the
standalone metastore and metastore will use the existing mechanism to load and
deserialize from the Serde class.
4. Add the check in {{HiveMetaStoreUtils.getFieldsFromDeserializer}} to throw
exception before trying to use deserializer to get the schema if the
implementation of {{shouldStoreFieldsInMetastore}} returns false. I don't think
metastore can ever be 100% sure if SerDes declares that fields are not supposed
to be stored in metastore.
[~sershe] and [~alangates] What do you guys think about these suggestions?
> move custom SerDe schema considerations into metastore from QL
> --------------------------------------------------------------
>
> Key: HIVE-17714
> URL: https://issues.apache.org/jira/browse/HIVE-17714
> Project: Hive
> Issue Type: Bug
> Reporter: Sergey Shelukhin
> Assignee: Alan Gates
>
> Columns in metastore for tables that use external schema don't have the type
> information (since HIVE-11985) and may be entirely inconsistent (since
> forever, due to issues like HIVE-17713; or for SerDes that allow an URL for
> the schema, due to a change in the underlying file).
> Currently, if you trace the usage of ConfVars.SERDESUSINGMETASTOREFORSCHEMA,
> and to MetaStoreUtils.getFieldsFromDeserializer, you'd see that the code in
> QL handles this in Hive. So, for the most part metastore just returns
> whatever is stored for columns in the database.
> One exception appears to be get_fields_with_environment_context, which is
> interesting... so getTable will return incorrect columns (potentially), but
> get_fields/get_schema will return correct ones from SerDe as far as I can
> tell.
> As part of separating the metastore, we should make sure all the APIs return
> the correct schema for the columns; it's not a good idea to have everyone
> reimplement getFieldsFromDeserializer.
> Note: this should also remove a flag introduced in HIVE-17731
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)