[ 
https://issues.apache.org/jira/browse/DRILL-5941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259349#comment-16259349
 ] 

ASF GitHub Bot commented on DRILL-5941:
---------------------------------------

Github user arina-ielchiieva commented on the issue:

    https://github.com/apache/drill/pull/1030
  
    @ppadma performance impact will be only on those tables that have skip 
header / footer functionality (one reader per file), for other tables 
processing will be the same (one reader per input). Currently for tables that 
have skip header / footer functionality and have multiple input splits, these 
rows are skipped for each input, thus leaving user with incorrect data set.
    
    Regarding your suggestion, as I have mentioned in my previous comment 
(please see links to the code there as well) we use hadoop reader for the data 
and don't have information about number of rows in input split (I wish we did, 
it would make life much easier). 
    
    I have an idea how to parallelize readers when we have only header (though 
still all readers will be on the same node) but when we have footer we'll have 
to read one file per reader.  Also we can consider Drill text reader usage 
instead of hadoop one (as we do for parquet files).
    
    Anyway, I suggest we commit these changes and create new Jira for future 
improvement of skip header / footer performance. Will this be acceptable?


> Skip header / footer logic works incorrectly for Hive tables when file has 
> several input splits
> -----------------------------------------------------------------------------------------------
>
>                 Key: DRILL-5941
>                 URL: https://issues.apache.org/jira/browse/DRILL-5941
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Hive
>    Affects Versions: 1.11.0
>            Reporter: Arina Ielchiieva
>            Assignee: Arina Ielchiieva
>             Fix For: Future
>
>
> *To reproduce*
> 1. Create csv file with two columns (key, value) for 3000029 rows, where 
> first row is a header.
> The data file has size of should be greater than chunk size of 256 MB. Copy 
> file to the distributed file system.
> 2. Create table in Hive:
> {noformat}
> CREATE EXTERNAL TABLE `h_table`(
>   `key` bigint,
>   `value` string)
> ROW FORMAT DELIMITED
>   FIELDS TERMINATED BY ','
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.mapred.TextInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION
>   'maprfs:/tmp/h_table'
> TBLPROPERTIES (
>  'skip.header.line.count'='1');
> {noformat}
> 3. Execute query {{select * from hive.h_table}} in Drill (query data using 
> Hive plugin). The result will return less rows then expected. Expected result 
> is 3000028 (total count minus one row as header).
> *The root cause*
> Since file is greater than default chunk size, it's split into several 
> fragments, known as input splits. For example:
> {noformat}
> maprfs:/tmp/h_table/h_table.csv:0+268435456
> maprfs:/tmp/h_table/h_table.csv:268435457+492782112
> {noformat}
> TextHiveReader is responsible for handling skip header and / or footer logic.
> Currently Drill creates reader [for each input 
> split|https://github.com/apache/drill/blob/master/contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/HiveScanBatchCreator.java#L84]
>  and skip header and /or footer logic is applied for each input splits, 
> though ideally the above mentioned input splits should have been read by one 
> reader, so skip / header footer logic was applied correctly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to