[
https://issues.apache.org/jira/browse/HIVE-25765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Work on HIVE-25765 started by Ganesha Shreedhara.
-------------------------------------------------
> skip.header.line.count property skips rows of each block in FetchOperator
> when file size is larger
> --------------------------------------------------------------------------------------------------
>
> Key: HIVE-25765
> URL: https://issues.apache.org/jira/browse/HIVE-25765
> Project: Hive
> Issue Type: Bug
> Affects Versions: 3.1.2
> Reporter: Ganesha Shreedhara
> Assignee: Ganesha Shreedhara
> Priority: Major
> Labels: pull-request-available
> Attachments: data.txt.gz
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> When _skip.header.line.count_ property is set in table properties, simple
> select queries that gets converted into FetchTask skip rows of each block
> instead of skipping header lines of each file. This happens when the file
> size is larger and file is read in blocks. This issue doesn't exist when
> select query is converted into map only job by setting
> _hive.fetch.task.conversion_ to _none_ because the header lines are skipped
> only for the first block because of [this
> check|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/HiveContextAwareRecordReader.java#L330]
> We should have similar check in FetchOperator to avoid this issue.
>
> *Steps to reproduce:*
> {code:java}
> -- Create table on top of the data file (uncompressed size: ~239M) attached
> in this ticket
> CREATE EXTERNAL TABLE test_table(
> col1 string,
> col2 string,
> col3 string,
> col4 string,
> col5 string,
> col6 string,
> col7 string,
> col8 string,
> col9 string,
> col10 string,
> col11 string,
> col12 string)
> ROW FORMAT SERDE
> 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
> STORED AS INPUTFORMAT
> 'org.apache.hadoop.mapred.TextInputFormat'
> OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION
> 'location_of_data_file'
> TBLPROPERTIES ('skip.header.line.count'='1');
> -- Counting number of rows gives correct result with only one header line
> skipped
> select count(*) from test_table;
> 3145727
> -- Select query skips more rows and the result depends upon the number of
> blocks configured in underlying filesystem. 3 rows are skipped when the file
> is read in 3 blocks.
> select * from test_table;
> .
> .
> Fetched 3145724 rows
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)