Sergey Shelukhin created MAPREDUCE-6635:
-------------------------------------------
Summary: Unsafe long to int conversion in
UncompressedSplitLineReader and IndexOutOfBoundsException
Key: MAPREDUCE-6635
URL: https://issues.apache.org/jira/browse/MAPREDUCE-6635
Project: Hadoop Map/Reduce
Issue Type: Bug
Reporter: Sergey Shelukhin
LineRecordReader creates the unsplittable reader like so:
{noformat}
in = new UncompressedSplitLineReader(
fileIn, job, recordDelimiter, split.getLength());
{noformat}
Split length goes to
{noformat}
private long splitLength;
{noformat}
At some point when reading the first line, fillBuffer does this:
{noformat}
@Override
protected int fillBuffer(InputStream in, byte[] buffer, boolean inDelimiter)
throws IOException {
int maxBytesToRead = buffer.length;
if (totalBytesRead < splitLength) {
maxBytesToRead = Math.min(maxBytesToRead,
(int)(splitLength - totalBytesRead));
{noformat}
which will be a negative number for large splits, and the subsequent dfs read
will fail with a boundary check.
This has been reported here: https://issues.streamsets.com/browse/SDC-2229,
also happens in Hive if very large text files are forced to be read in a single
split (e.g. via header-skipping feature, or via set
mapred.min.split.size=9999999999999999;)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)