[
https://issues.apache.org/jira/browse/HIVE-11023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sushanth Sowmyan updated HIVE-11023:
------------------------------------
Attachment: HIVE-11023.patch
> Disable directSQL if datanucleus.identifierFactory = datanucleus2
> -----------------------------------------------------------------
>
> Key: HIVE-11023
> URL: https://issues.apache.org/jira/browse/HIVE-11023
> Project: Hive
> Issue Type: Bug
> Components: Metastore
> Affects Versions: 1.3.0, 1.2.1, 2.0.0
> Reporter: Sushanth Sowmyan
> Assignee: Sushanth Sowmyan
> Priority: Critical
> Attachments: HIVE-11023.patch
>
>
> We hit an interesting bug in a case where datanucleus.identifierFactory =
> datanucleus2 .
> The problem is that directSql handgenerates SQL strings assuming
> "datanucleus1" naming scheme. If a user has their metastore JDO managed by
> datanucleus.identifierFactory = datanucleus2 , the SQL strings we generate
> are incorrect.
> One simple example of what this results in is the following: whenever DN
> persists a field which is held as a List<T>, it winds up storing each T as a
> separate line in the appropriate mapping table, and has a column called
> INTEGER_IDX, which holds the position in the list. Then, upon reading, it
> automatically reads all relevant lines with an ORDER BY INTEGER_IDX, which
> results in the list retaining its order. In DN2 naming scheme, the column is
> called IDX, instead of INTEGER_IDX. If the user has run appropriate metatool
> upgrade scripts, it is highly likely that they have both columns, INTEGER_IDX
> and IDX.
> Whenever they use JDO, such as with all writes, it will then use the IDX
> field, and when they do any sort of optimized reads, such as through
> directSQL, it will ORDER BY INTEGER_IDX.
> An immediate danger is seen when we consider that the schema of a table is
> stored as a List<FieldSchema> , and while IDX has 0,1,2,3,... , INTEGER_IDX
> will contain 0,0,0,0,... and thus, any attempt to describe the table or fetch
> schema for the table can come up mixed up in the table's native hashing
> order, rather than sorted by the index.
> This can then result in schema ordering being different from the actual
> table. For eg:, if a user has a (a:int,b:string,c:string), a describe on this
> may return (c:string, a:int, b: string), and thus, queries which are
> inserting after selecting from another table can have ClassCastExceptions
> when trying to insert data in the wong order - this is how we discovered this
> bug. This problem, however, can be far worse, if there are no type problems -
> it is possible, for eg., that if a,b&c were all strings, that that insert
> query would succeed but mix up the order, which then results in user table
> data being mixed up. This has the potential to be very bad.
> We should write a tool to help convert metastores that use "datanucleus2" to
> "datanucleus1"(more difficult, needs more one-time testing) or change
> directSql to support both(easier to code, but increases test-coverage matrix
> significantly and we should really then be testing against both schemes). But
> in the short term, we should disable directSql if we see that the
> identifierfactory is "datanucleus2"
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)