[jira] [Resolved] (HIVE-28262) Single column use MultiDelimitSerDe parse column error

Butao Zhang (Jira) Tue, 02 Jul 2024 18:15:04 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-28262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Butao Zhang resolved HIVE-28262.
--------------------------------
    Resolution: Fixed

Merged into master branch!

Thanks [~laughing_vzr] for the fix!!!

> Single column use MultiDelimitSerDe parse column error
> ------------------------------------------------------
>
>                 Key: HIVE-28262
>                 URL: https://issues.apache.org/jira/browse/HIVE-28262
>             Project: Hive
>          Issue Type: Bug
>          Components: HiveServer2
>    Affects Versions: 3.1.3, 4.1.0
>         Environment: Hive version: 3.1.3
>            Reporter: Liu Weizheng
>            Assignee: Liu Weizheng
>            Priority: Major
>              Labels: HiveServer2, pull-request-available
>             Fix For: 4.1.0
>
>         Attachments: CleanShot 2024-05-16 at [email protected], CleanShot 
> 2024-05-16 at [email protected]
>
>
> ENV:
> Hive: 3.1.3/4.1.0
> HDFS: 3.3.1
> --------------------------
> Create a text file for external table load，(e.g:/tmp/data):
>  
> {code:java}
> 1|@|
> 2|@|
> 3|@| {code}
>  
>  
> Create external table:
>  
> {code:java}
> CREATE EXTERNAL TABLE IF NOT EXISTS test_split_tmp(`ID` string) ROW FORMAT 
> SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' WITH 
> SERDEPROPERTIES('field.delim'='|@|') STORED AS textfile location 
> '/tmp/test_split_tmp'; {code}
>  
> put text file to external table path:
>  
> {code:java}
> hdfs dfs -put /tmp/data /tmp/test_split_tmp {code}
>  
>  
> query this table and cast column id to long type:
>  
> {code:java}
> select UDFToLong(`id`) from test_split_tmp; {code}
> *why use UDFToLong function? because  it will get NULL result in this 
> condition，but string type '1' use this function should get  type long 1 
> result.*
> {code:java}
> +--------+
> | id     |
> +--------+
> | NULL   |
> | NULL   |
> | NULL   |
> +--------+ {code}
> Therefore, I speculate that there is an issue with the field splitting in 
> MultiDelimitSerde.
> when I debug this issue, I found some problem below:
>  * org.apache.hadoop.hive.serde2.lazy.LazyStruct#findIndexes
>            *when fields.length=1 can't find the delimit index*
>  
> {code:java}
> private int[] findIndexes(byte[] array, byte[] target) {
>   if (fields.length <= 1) {  // bug
>     return new int[0];
>   }
>   ...
>   for (int i = 1; i < indexes.length; i++) {  // bug
>     array = Arrays.copyOfRange(array, indexInNewArray + target.length, 
> array.length);
>     indexInNewArray = Bytes.indexOf(array, target);
>     if (indexInNewArray == -1) {
>       break;
>     }
>     indexes[i] = indexInNewArray + indexes[i - 1] + target.length;
>   }
>   return indexes;
> }{code}
>  
>  * org.apache.hadoop.hive.serde2.lazy.LazyStruct#parseMultiDelimit
>            *when fields.length=1 can't find the column startPosition*
>  
> {code:java}
> public void parseMultiDelimit(byte[] rawRow, byte[] fieldDelimit) {
>   ...
>   int[] delimitIndexes = findIndexes(rawRow, fieldDelimit);
>   ...
>     if (fields.length > 1 && delimitIndexes[i - 1] != -1) { // bug
>       int start = delimitIndexes[i - 1] + fieldDelimit.length;
>       startPosition[i] = start - i * diff;
>     } else {
>       startPosition[i] = length + 1;
>     }
>   }
>   Arrays.fill(fieldInited, false);
>   parsed = true;
> }{code}
>  
>  
> Multi delimit Process:
> *Actual:*  1|@| -> 1^A  id column start 0 ,next column start 1
> *Expected:*  1|@| -> 1^A  id column start 0 ,next column start 2
>  
> Fix:
>  # fields.length=1 should  find multi delimit index
>  # fields.length=1 should  calculate column start position correct
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (HIVE-28262) Single column use MultiDelimitSerDe parse column error

Reply via email to