[jira] [Work started] (HIVE-25765) skip.header.line.count property skips rows of each block in FetchOperator when file size is larger

Ganesha Shreedhara (Jira) Thu, 02 Dec 2021 08:52:17 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-25765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Work on HIVE-25765 started by Ganesha Shreedhara.
-------------------------------------------------
> skip.header.line.count property skips rows of each block in FetchOperator 
> when file size is larger
> --------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-25765
>                 URL: https://issues.apache.org/jira/browse/HIVE-25765
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 3.1.2
>            Reporter: Ganesha Shreedhara
>            Assignee: Ganesha Shreedhara
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: data.txt.gz
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> When _skip.header.line.count_ property is set in table properties, simple 
> select queries that gets converted into FetchTask skip rows of each block 
> instead of skipping header lines of each file. This happens when the file 
> size is larger and file is read in blocks. This issue doesn't exist when 
> select query is converted into map only job by setting 
> _hive.fetch.task.conversion_ to _none_ because the header lines are skipped 
> only for the first block because of [this 
> check|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/HiveContextAwareRecordReader.java#L330]
>  We should have similar check in FetchOperator to avoid this issue. 
>  
> *Steps to reproduce:* 
> {code:java}
> -- Create table on top of the data file (uncompressed size: ~239M) attached 
> in this ticket
> CREATE EXTERNAL TABLE test_table(
>   col1 string,
>   col2 string,
>   col3 string,
>   col4 string,
>   col5 string,
>   col6 string,
>   col7 string,
>   col8 string,
>   col9 string,
>   col10 string,
>   col11 string,
>   col12 string)
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.mapred.TextInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION
>   'location_of_data_file'
> TBLPROPERTIES ('skip.header.line.count'='1');
> -- Counting number of rows gives correct result with only one header line 
> skipped
> select count(*) from test_table;
> 3145727
> -- Select query skips more rows and the result depends upon the number of 
> blocks configured in underlying filesystem. 3 rows are skipped when the file 
> is read in 3 blocks. 
> select * from test_table;
> .
> .
> Fetched 3145724 rows
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work started] (HIVE-25765) skip.header.line.count property skips rows of each block in FetchOperator when file size is larger

Reply via email to