[
https://issues.apache.org/jira/browse/HIVE-28262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Butao Zhang resolved HIVE-28262.
--------------------------------
Resolution: Fixed
Merged into master branch!
Thanks [~laughing_vzr] for the fix!!!
> Single column use MultiDelimitSerDe parse column error
> ------------------------------------------------------
>
> Key: HIVE-28262
> URL: https://issues.apache.org/jira/browse/HIVE-28262
> Project: Hive
> Issue Type: Bug
> Components: HiveServer2
> Affects Versions: 3.1.3, 4.1.0
> Environment: Hive version: 3.1.3
> Reporter: Liu Weizheng
> Assignee: Liu Weizheng
> Priority: Major
> Labels: HiveServer2, pull-request-available
> Fix For: 4.1.0
>
> Attachments: CleanShot 2024-05-16 at [email protected], CleanShot
> 2024-05-16 at [email protected]
>
>
> ENV:
> Hive: 3.1.3/4.1.0
> HDFS: 3.3.1
> --------------------------
> Create a text file for external table load,(e.g:/tmp/data):
>
> {code:java}
> 1|@|
> 2|@|
> 3|@| {code}
>
>
> Create external table:
>
> {code:java}
> CREATE EXTERNAL TABLE IF NOT EXISTS test_split_tmp(`ID` string) ROW FORMAT
> SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' WITH
> SERDEPROPERTIES('field.delim'='|@|') STORED AS textfile location
> '/tmp/test_split_tmp'; {code}
>
> put text file to external table path:
>
> {code:java}
> hdfs dfs -put /tmp/data /tmp/test_split_tmp {code}
>
>
> query this table and cast column id to long type:
>
> {code:java}
> select UDFToLong(`id`) from test_split_tmp; {code}
> *why use UDFToLong function? because it will get NULL result in this
> condition,but string type '1' use this function should get type long 1
> result.*
> {code:java}
> +--------+
> | id |
> +--------+
> | NULL |
> | NULL |
> | NULL |
> +--------+ {code}
> Therefore, I speculate that there is an issue with the field splitting in
> MultiDelimitSerde.
> when I debug this issue, I found some problem below:
> * org.apache.hadoop.hive.serde2.lazy.LazyStruct#findIndexes
> *when fields.length=1 can't find the delimit index*
>
> {code:java}
> private int[] findIndexes(byte[] array, byte[] target) {
> if (fields.length <= 1) { // bug
> return new int[0];
> }
> ...
> for (int i = 1; i < indexes.length; i++) { // bug
> array = Arrays.copyOfRange(array, indexInNewArray + target.length,
> array.length);
> indexInNewArray = Bytes.indexOf(array, target);
> if (indexInNewArray == -1) {
> break;
> }
> indexes[i] = indexInNewArray + indexes[i - 1] + target.length;
> }
> return indexes;
> }{code}
>
> * org.apache.hadoop.hive.serde2.lazy.LazyStruct#parseMultiDelimit
> *when fields.length=1 can't find the column startPosition*
>
> {code:java}
> public void parseMultiDelimit(byte[] rawRow, byte[] fieldDelimit) {
> ...
> int[] delimitIndexes = findIndexes(rawRow, fieldDelimit);
> ...
> if (fields.length > 1 && delimitIndexes[i - 1] != -1) { // bug
> int start = delimitIndexes[i - 1] + fieldDelimit.length;
> startPosition[i] = start - i * diff;
> } else {
> startPosition[i] = length + 1;
> }
> }
> Arrays.fill(fieldInited, false);
> parsed = true;
> }{code}
>
>
> Multi delimit Process:
> *Actual:* 1|@| -> 1^A id column start 0 ,next column start 1
> *Expected:* 1|@| -> 1^A id column start 0 ,next column start 2
>
> Fix:
> # fields.length=1 should find multi delimit index
> # fields.length=1 should calculate column start position correct
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)