Liu Weizheng created HIVE-28262: ----------------------------------- Summary: Single column use MultiDelimitSerDe parse column error Key: HIVE-28262 URL: https://issues.apache.org/jira/browse/HIVE-28262 Project: Hive Issue Type: Bug Components: HiveServer2 Affects Versions: 3.1.3, 4.1.0 Environment: Hive version: 3.1.3 Reporter: Liu Weizheng Assignee: Liu Weizheng Fix For: 4.1.0 Attachments: CleanShot 2024-05-16 at 15.13...@2x.png, CleanShot 2024-05-16 at 15.17...@2x.png
ENV: Hive: 3.1.3/4.1.0 HDFS: 3.3.1 -------------------------- Create a text file for external table load,(e.g:/tmp/data): {code:java} 1|@| 2|@| 3|@| {code} Create external table: {code:java} CREATE EXTERNAL TABLE IF NOT EXISTS test_split_tmp(`ID` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' WITH SERDEPROPERTIES('field.delim'='|@|') STORED AS textfile location '/tmp/test_split_tmp'; {code} put text file to external table path: {code:java} hdfs dfs -put /tmp/data /tmp/test_split_tmp {code} query this table and cast column id to long type: {code:java} select UDFToLong(`id`) from test_split_tmp; {code} *why use UDFToLong function? because it will get NULL result in this condition,but string type '1' use this function should get type long 1 result.* {code:java} +--------+ | id | +--------+ | NULL | | NULL | | NULL | +--------+ {code} Therefore, I speculate that there is an issue with the field splitting in MultiDelimitSerde. when I debug this issue, I found some problem below: * org.apache.hadoop.hive.serde2.lazy.LazyStruct#findIndexes *when fields.length=1 can't find the delimit index* {code:java} private int[] findIndexes(byte[] array, byte[] target) { if (fields.length <= 1) { // bug return new int[0]; } ... for (int i = 1; i < indexes.length; i++) { // bug array = Arrays.copyOfRange(array, indexInNewArray + target.length, array.length); indexInNewArray = Bytes.indexOf(array, target); if (indexInNewArray == -1) { break; } indexes[i] = indexInNewArray + indexes[i - 1] + target.length; } return indexes; }{code} * org.apache.hadoop.hive.serde2.lazy.LazyStruct#parseMultiDelimit *when fields.length=1 can't find the column startPosition* {code:java} public void parseMultiDelimit(byte[] rawRow, byte[] fieldDelimit) { ... int[] delimitIndexes = findIndexes(rawRow, fieldDelimit); ... if (fields.length > 1 && delimitIndexes[i - 1] != -1) { // bug int start = delimitIndexes[i - 1] + fieldDelimit.length; startPosition[i] = start - i * diff; } else { startPosition[i] = length + 1; } } Arrays.fill(fieldInited, false); parsed = true; }{code} Multi delimit Process: *Actual:* 1|@| -> 1^A id column start 0 ,next column start 1 *Expected:* 1|@| -> 1^A id column start 0 ,next column start 2 Fix: # fields.length=1 should find multi delimit index # fields.length=1 should calculate column start position correct -- This message was sent by Atlassian Jira (v8.20.10#820010)