Liu Weizheng created HIVE-28262:
-----------------------------------
Summary: Single column use MultiDelimitSerDe parse column error
Key: HIVE-28262
URL: https://issues.apache.org/jira/browse/HIVE-28262
Project: Hive
Issue Type: Bug
Components: HiveServer2
Affects Versions: 3.1.3, 4.1.0
Environment: Hive version: 3.1.3
Reporter: Liu Weizheng
Assignee: Liu Weizheng
Fix For: 4.1.0
Attachments: CleanShot 2024-05-16 at [email protected], CleanShot
2024-05-16 at [email protected]
ENV:
Hive: 3.1.3/4.1.0
HDFS: 3.3.1
--------------------------
Create a text file for external table load,(e.g:/tmp/data):
{code:java}
1|@|
2|@|
3|@| {code}
Create external table:
{code:java}
CREATE EXTERNAL TABLE IF NOT EXISTS test_split_tmp(`ID` string) ROW FORMAT
SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' WITH
SERDEPROPERTIES('field.delim'='|@|') STORED AS textfile location
'/tmp/test_split_tmp'; {code}
put text file to external table path:
{code:java}
hdfs dfs -put /tmp/data /tmp/test_split_tmp {code}
query this table and cast column id to long type:
{code:java}
select UDFToLong(`id`) from test_split_tmp; {code}
*why use UDFToLong function? because it will get NULL result in this
condition,but string type '1' use this function should get type long 1 result.*
{code:java}
+--------+
| id |
+--------+
| NULL |
| NULL |
| NULL |
+--------+ {code}
Therefore, I speculate that there is an issue with the field splitting in
MultiDelimitSerde.
when I debug this issue, I found some problem below:
* org.apache.hadoop.hive.serde2.lazy.LazyStruct#findIndexes
*when fields.length=1 can't find the delimit index*
{code:java}
private int[] findIndexes(byte[] array, byte[] target) {
if (fields.length <= 1) { // bug
return new int[0];
}
...
for (int i = 1; i < indexes.length; i++) { // bug
array = Arrays.copyOfRange(array, indexInNewArray + target.length,
array.length);
indexInNewArray = Bytes.indexOf(array, target);
if (indexInNewArray == -1) {
break;
}
indexes[i] = indexInNewArray + indexes[i - 1] + target.length;
}
return indexes;
}{code}
* org.apache.hadoop.hive.serde2.lazy.LazyStruct#parseMultiDelimit
*when fields.length=1 can't find the column startPosition*
{code:java}
public void parseMultiDelimit(byte[] rawRow, byte[] fieldDelimit) {
...
int[] delimitIndexes = findIndexes(rawRow, fieldDelimit);
...
if (fields.length > 1 && delimitIndexes[i - 1] != -1) { // bug
int start = delimitIndexes[i - 1] + fieldDelimit.length;
startPosition[i] = start - i * diff;
} else {
startPosition[i] = length + 1;
}
}
Arrays.fill(fieldInited, false);
parsed = true;
}{code}
Multi delimit Process:
*Actual:* 1|@| -> 1^A id column start 0 ,next column start 1
*Expected:* 1|@| -> 1^A id column start 0 ,next column start 2
Fix:
# fields.length=1 should find multi delimit index
# fields.length=1 should calculate column start position correct
--
This message was sent by Atlassian Jira
(v8.20.10#820010)