[
https://issues.apache.org/jira/browse/HIVE-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859141#comment-15859141
]
Charles Bernard commented on HIVE-12718:
----------------------------------------
We are experiencing the same issue running CDH 5.8.0.
Our problem is that the wrong line (not the last one) is being skipped. Forcing
one mapper only does not help.
> skip.footer.line.count misbehaves on larger text files
> ------------------------------------------------------
>
> Key: HIVE-12718
> URL: https://issues.apache.org/jira/browse/HIVE-12718
> Project: Hive
> Issue Type: Bug
> Affects Versions: 1.1.0
> Environment: The bug was discovered and reproduced on a Cloudera
> Hadoop 5.4 distribution running on CentOS 6.4.
> Reporter: Gergely Nagy
> Priority: Minor
>
> We noticed that when working on a table backed by a larger (large enough to
> require splitting) text file, the {{skip.footer.line.count}} property of the
> table misbehaves: the footer is not being ignored.
> To reproduce, follow these steps:
> 1) Create a large file: {{for i in $(seq 1 100); do cat
> /usr/share/dict/words; done >large.txt}}
> 2) Upload it to HDFS (eg, as {{/tmp/words}})
> 3) Create an external table with {{skip.footer.line.count}} set:
> {quote}
> CREATE EXTERNAL TABLE ext_words (word STRING)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
> LINES TERMINATED BY '\n'
> STORED AS TEXTFILE LOCATION '/tmp/words'
> tblproperties("skip.header.line.count"="1", "skip.footer.line.count"="1");
> {quote}
> 4) Count the number of times the last line (in this example, I assume that to
> be {{ZZZ}}) appears: {{SELECT COUNT( * ) FROM ext_words WHERE word = 'ZZZ';}}
> 5) Observe that it returns 100 instead of 99.
> Investigation showed that this happens when there are more than one mappers
> used for the job. If we increase the split size, to force using one mapper
> only, the problem did not occur.
> There may be other related issues as well, like the wrong line being skipped
> -- but we did not reproduce those yet.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)