[
https://issues.apache.org/jira/browse/DRILL-5941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251801#comment-16251801
]
ASF GitHub Bot commented on DRILL-5941:
---------------------------------------
Github user arina-ielchiieva commented on the issue:
https://github.com/apache/drill/pull/1030
@ppadma
To create reader for each input split and maintain skip header / footer
functionality we need to know how many rows are in input split. Unfortunately,
input split does not hold such information, only number of bytes. [1] We can't
apply skip header functionality for the first input split and skip footer for
the last input either since we don't know how many rows will be skipped, it can
be the situation that we need to skip the whole first input split and partially
second.
@paul-rogers
To read from hive we actually use Hadoop reader [2, 3] so if I am not
mistaken unfortunately the described above approach can be applied.
[1]
https://hadoop.apache.org/docs/r2.7.0/api/org/apache/hadoop/mapred/FileSplit.html
[2]
https://github.com/apache/drill/blob/master/contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/HiveAbstractReader.java#L234
[3]
https://hadoop.apache.org/docs/r2.7.0/api/org/apache/hadoop/mapred/RecordReader.html
> Skip header / footer logic works incorrectly for Hive tables when file has
> several input splits
> -----------------------------------------------------------------------------------------------
>
> Key: DRILL-5941
> URL: https://issues.apache.org/jira/browse/DRILL-5941
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Hive
> Affects Versions: 1.11.0
> Reporter: Arina Ielchiieva
> Assignee: Arina Ielchiieva
> Fix For: Future
>
>
> *To reproduce*
> 1. Create csv file with two columns (key, value) for 3000029 rows, where
> first row is a header.
> The data file has size of should be greater than chunk size of 256 MB. Copy
> file to the distributed file system.
> 2. Create table in Hive:
> {noformat}
> CREATE EXTERNAL TABLE `h_table`(
> `key` bigint,
> `value` string)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS INPUTFORMAT
> 'org.apache.hadoop.mapred.TextInputFormat'
> OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION
> 'maprfs:/tmp/h_table'
> TBLPROPERTIES (
> 'skip.header.line.count'='1');
> {noformat}
> 3. Execute query {{select * from hive.h_table}} in Drill (query data using
> Hive plugin). The result will return less rows then expected. Expected result
> is 3000028 (total count minus one row as header).
> *The root cause*
> Since file is greater than default chunk size, it's split into several
> fragments, known as input splits. For example:
> {noformat}
> maprfs:/tmp/h_table/h_table.csv:0+268435456
> maprfs:/tmp/h_table/h_table.csv:268435457+492782112
> {noformat}
> TextHiveReader is responsible for handling skip header and / or footer logic.
> Currently Drill creates reader [for each input
> split|https://github.com/apache/drill/blob/master/contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/HiveScanBatchCreator.java#L84]
> and skip header and /or footer logic is applied for each input splits,
> though ideally the above mentioned input splits should have been read by one
> reader, so skip / header footer logic was applied correctly.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)