Taraka Rama Rao Lethavadla created HIVE-27498:
-------------------------------------------------

             Summary: Support custom delimiter in SkippingTextInputFormat
                 Key: HIVE-27498
                 URL: https://issues.apache.org/jira/browse/HIVE-27498
             Project: Hive
          Issue Type: Bug
          Components: Hive
            Reporter: Taraka Rama Rao Lethavadla


Simple select is returning results as expected when there are configs
{noformat}
'skip.header.line.count'='1',                    
'textinputformat.record.delimiter'='|'{noformat}
but if we execute select count(*) or any query that launches a tez job is 
considering the whole text as single line

*Test case*

data.csv
{noformat}
Code    Name|A AAAA|B BBBB
CCCC|C  DDDD{noformat}
DDL
{noformat}
create external table test(code string,name string)
ROW FORMAT SERDE
   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
 WITH SERDEPROPERTIES (
   'field.delim'='\t')
 STORED AS INPUTFORMAT
   'org.apache.hadoop.mapred.TextInputFormat'
 OUTPUTFORMAT
   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
   location '${system:test.tmp.dir}/test'
 TBLPROPERTIES (
   'skip.header.line.count'='1',
   'textinputformat.record.delimiter'='|');{noformat}
Query result
select code,name from test;
{noformat}
A AAAA
B BBBB
CCCC
C DDDD{noformat}
*Problem:* But query _+select count(*) from test+_  is returning 1 instead of 3

It used to work in older hive versions.

The difference in behaviour started to happen after the introduction of feature 
https://issues.apache.org/jira/browse/HIVE-21924

The feature aims at splitting the text files while reading even though the 
table has configuration to skip headers. There by increasing the number of 
mappers to process the query there by improving throughput of the query.

The actual problem lies in how new feature is reading a file. It does not 
consider 'textinputformat.record.delimiter' property and tries to read the file 
looking for new line characters. Since the input file does not have a new line 
for every record, it is reading the whole file as single line and count is 
returned as 1

Ref: 
[https://github.com/apache/hive/blob/24a82a65f96b65eeebe4e23b2fec425037a70216/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L548]

 

 *Workaround*

If we can remove headers in the data and skip header config in table properties 
or compress the files, then we will get into this issue

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to