[ 
https://issues.apache.org/jira/browse/HIVE-21940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Laszlo Bodor updated HIVE-21940:
--------------------------------
    Attachment: HIVE-21940.01.patch

> Metastore: Postgres text <-> clob mismatch for PARTITION_PARAMS/PARAM_VALUE
> ---------------------------------------------------------------------------
>
>                 Key: HIVE-21940
>                 URL: https://issues.apache.org/jira/browse/HIVE-21940
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 3.2.0
>            Reporter: Laszlo Bodor
>            Assignee: Laszlo Bodor
>            Priority: Major
>             Fix For: 4.0.0
>
>         Attachments: HIVE-21940.01.patch, HIVE-21940.repro.patch
>
>
> This issue was found while running sqoop/hive tests on a cluster with hive 
> with postgres metastore, and it turned out the problem is that PARAM_VALUE is 
> handled as it was CLOB but it's a text, so extractSqlClob returns it as is.
> It's reproducible on cluster by the following statements:
> {code}
> USE default;
> drop table if exists my_table;
> create external table my_table (col1 int, col3 int) partitioned by (col2 
> string) STORED AS TEXTFILE;
> insert into my_table VALUES(11,201,"F");
> SELECT pp.* FROM sys.partition_params pp join sys.partitions p on p.part_id = 
> pp.part_id join sys.tbls t on t.tbl_id = p.tbl_id where t.tbl_name = 
> "my_table";
> {code}
> sys query results in:
> {code}
> +-------------+------------------------+-----------------+
> | pp.part_id  |      pp.param_key      | pp.param_value  |
> +-------------+------------------------+-----------------+
> | 151         | rawDataSize            | 28629           |
> | 151         | numRows                | 28628           |
> | 151         | transient_lastDdlTime  | 28627           |
> | 151         | COLUMN_STATS_ACCURATE  | 28626           |
> | 151         | numFiles               | 28625           |
> | 151         | totalSize              | 28622           |
> +-------------+------------------------+-----------------+
> {code}
> Seems like (propably) since HIVE-20833/HIVE-20221 there is an inconvenience 
> while using PARTITION_PARAMS/PARAM_VALUE, because in postgres there is no 
> such type as CLOB, and metastore simply saves large object ids into this 
> field. More interesting is that the large object can be resolved in some 
> codepaths. In case of a describe for partition it works correctly:
> {code}
> describe formatted my_table partition (col2='F');
> ...
> | Partition Parameters:             | NULL                                    
>            | NULL                                               |
> |                                   | COLUMN_STATS_ACCURATE                   
>            | 
> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col3\":\"true\"}}
>  |
> |                                   | numFiles                                
>            | 1                                                  |
> |                                   | numRows                                 
>            | 1                                                  |
> |                                   | rawDataSize                             
>            | 6                                                  |
> |                                   | totalSize                               
>            | 7                                                  |
> |                                   | transient_lastDdlTime                   
>            | 1561976024                                         |
> |                                   | NULL                                    
>            | NULL                                               |
> {code}
> But in case of a direct metastore query (from hive's sys schema, but the same 
> result for direct postgres), it shows the result above (see sys query 
> output). This is an issue when hive treats these ids as they were real 
> values, but they are obviously not correct, and this causes various failures 
> (e.g. using serde parameter serialization.format=28392)
> param_value values above are large object ids, according to pg_dump
> | 151         | COLUMN_STATS_ACCURATE  | 28626           |
> {code}
> SELECT pg_catalog.lo_open('28626', 131072);
> SELECT pg_catalog.lowrite(0, 
> '\x7b2242415349435f5354415453223a2274727565222c22434f4c554d4e5f5354415453223a7b22636f6c31223a2274727565222c22636f6c33223a2274727565227d7d');
> SELECT pg_catalog.lo_close(0);
> {code}
> decoded large object value:
> {code}
> {"BASIC_STATS":"true","COLUMN_STATS":{"col1":"true","col3":"true"}}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to