[
https://issues.apache.org/jira/browse/HIVE-21940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877931#comment-16877931
]
Laszlo Bodor commented on HIVE-21940:
-------------------------------------
Could you please review [~kgyrtkirk]?
The point of the fix is the modification in package.jdo, all others are just
refactors and testing improvements (in order to be able to test with postgres
metastore).
> Metastore: Postgres text <-> clob mismatch for PARTITION_PARAMS/PARAM_VALUE
> ---------------------------------------------------------------------------
>
> Key: HIVE-21940
> URL: https://issues.apache.org/jira/browse/HIVE-21940
> Project: Hive
> Issue Type: Bug
> Affects Versions: 3.2.0
> Reporter: Laszlo Bodor
> Assignee: Laszlo Bodor
> Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-21940.01.patch, HIVE-21940.repro.patch
>
>
> Summary: this is a fix for a regression introduced by HIVE-20833/HIVE-20221,
> fixed in the same way as HIVE-16667 earlier
> This issue was found while running sqoop/hive tests on a cluster with hive
> with postgres metastore, and it turned out the problem is that PARAM_VALUE is
> handled as it was CLOB but it's a text, so extractSqlClob returns it as is.
> It's reproducible on cluster by the following statements:
> {code}
> USE default;
> drop table if exists my_table;
> create external table my_table (col1 int, col3 int) partitioned by (col2
> string) STORED AS TEXTFILE;
> insert into my_table VALUES(11,201,"F");
> SELECT pp.* FROM sys.partition_params pp join sys.partitions p on p.part_id =
> pp.part_id join sys.tbls t on t.tbl_id = p.tbl_id where t.tbl_name =
> "my_table";
> {code}
> sys query results in:
> {code}
> +-------------+------------------------+-----------------+
> | pp.part_id | pp.param_key | pp.param_value |
> +-------------+------------------------+-----------------+
> | 151 | rawDataSize | 28629 |
> | 151 | numRows | 28628 |
> | 151 | transient_lastDdlTime | 28627 |
> | 151 | COLUMN_STATS_ACCURATE | 28626 |
> | 151 | numFiles | 28625 |
> | 151 | totalSize | 28622 |
> +-------------+------------------------+-----------------+
> {code}
> Seems like (propably) since HIVE-20833/HIVE-20221 there is an inconvenience
> while using PARTITION_PARAMS/PARAM_VALUE, because in postgres there is no
> such type as CLOB, and metastore simply saves large object ids into this
> field. More interesting is that the large object can be resolved in some
> codepaths. In case of a describe for partition it works correctly:
> {code}
> describe formatted my_table partition (col2='F');
> ...
> | Partition Parameters: | NULL
> | NULL |
> | | COLUMN_STATS_ACCURATE
> |
> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col3\":\"true\"}}
> |
> | | numFiles
> | 1 |
> | | numRows
> | 1 |
> | | rawDataSize
> | 6 |
> | | totalSize
> | 7 |
> | | transient_lastDdlTime
> | 1561976024 |
> | | NULL
> | NULL |
> {code}
> But in case of a direct metastore query (from hive's sys schema, but the same
> result for direct postgres), it shows the result above (see sys query
> output). This is an issue when hive treats these ids as they were real
> values, but they are obviously not correct, and this causes various failures
> (e.g. using serde parameter serialization.format=28392)
> param_value values above are large object ids, according to pg_dump
> | 151 | COLUMN_STATS_ACCURATE | 28626 |
> {code}
> SELECT pg_catalog.lo_open('28626', 131072);
> SELECT pg_catalog.lowrite(0,
> '\x7b2242415349435f5354415453223a2274727565222c22434f4c554d4e5f5354415453223a7b22636f6c31223a2274727565222c22636f6c33223a2274727565227d7d');
> SELECT pg_catalog.lo_close(0);
> {code}
> decoded large object value:
> {code}
> {"BASIC_STATS":"true","COLUMN_STATS":{"col1":"true","col3":"true"}}
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)