Arina Ielchiieva created DRILL-5941:
---------------------------------------

             Summary: Skip header / footer logic works incorrectly for Hive 
tables when file has several input splits
                 Key: DRILL-5941
                 URL: https://issues.apache.org/jira/browse/DRILL-5941
             Project: Apache Drill
          Issue Type: Bug
          Components: Storage - Hive
    Affects Versions: 1.11.0
            Reporter: Arina Ielchiieva
            Assignee: Arina Ielchiieva
             Fix For: 1.12.0


*To reproduce*
1. Create csv file with two columns (key, value) for 3000029 rows, where first 
row is a header.
The data file has size of should be greater than chunk size of 256 MB. Copy 
file to the distributed file system.

2. Create table in Hive:
{noformat}
CREATE EXTERNAL TABLE `h_table`(
  `key` bigint,
  `value` string)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'maprfs:/tmp/h_table'
TBLPROPERTIES (
 'skip.header.line.count'='1');
{noformat}

3. Execute query {{select * from hive.h_table}} in Drill (query data using Hive 
plugin). The result will return less rows then expected. Expected result is 
3000028 (total count minus one row as header).

*The root cause*
Since file is greater than default chunk size, it's split into several 
fragments, known as input splits. For example:
{noformat}
maprfs:/tmp/h_table/h_table.csv:0+268435456
maprfs:/tmp/h_table/h_table.csv:268435457+492782112
{noformat}

TextHiveReader is responsible for handling skip header and / or footer logic.
Currently Drill creates reader [for each input 
split|https://github.com/apache/drill/blob/master/contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/HiveScanBatchCreator.java#L84]
 and skip header and /or footer logic is applied for each input splits, though 
ideally the above mentioned input splits should have been read by one reader, 
so skip / header footer logic was applied correctly.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to