Taraka Rama Rao Lethavadla created HIVE-27498:
-------------------------------------------------
Summary: Support custom delimiter in SkippingTextInputFormat
Key: HIVE-27498
URL: https://issues.apache.org/jira/browse/HIVE-27498
Project: Hive
Issue Type: Bug
Components: Hive
Reporter: Taraka Rama Rao Lethavadla
Simple select is returning results as expected when there are configs
{noformat}
'skip.header.line.count'='1',
'textinputformat.record.delimiter'='|'{noformat}
but if we execute select count(*) or any query that launches a tez job is
considering the whole text as single line
*Test case*
data.csv
{noformat}
Code Name|A AAAA|B BBBB
CCCC|C DDDD{noformat}
DDL
{noformat}
create external table test(code string,name string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'='\t')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location '${system:test.tmp.dir}/test'
TBLPROPERTIES (
'skip.header.line.count'='1',
'textinputformat.record.delimiter'='|');{noformat}
Query result
select code,name from test;
{noformat}
A AAAA
B BBBB
CCCC
C DDDD{noformat}
*Problem:* But query _+select count(*) from test+_ is returning 1 instead of 3
It used to work in older hive versions.
The difference in behaviour started to happen after the introduction of feature
https://issues.apache.org/jira/browse/HIVE-21924
The feature aims at splitting the text files while reading even though the
table has configuration to skip headers. There by increasing the number of
mappers to process the query there by improving throughput of the query.
The actual problem lies in how new feature is reading a file. It does not
consider 'textinputformat.record.delimiter' property and tries to read the file
looking for new line characters. Since the input file does not have a new line
for every record, it is reading the whole file as single line and count is
returned as 1
Ref:
[https://github.com/apache/hive/blob/24a82a65f96b65eeebe4e23b2fec425037a70216/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L548]
*Workaround*
If we can remove headers in the data and skip header config in table properties
or compress the files, then we will get into this issue
--
This message was sent by Atlassian Jira
(v8.20.10#820010)