Sergey Shelukhin created MAPREDUCE-6635: -------------------------------------------
Summary: Unsafe long to int conversion in UncompressedSplitLineReader and IndexOutOfBoundsException Key: MAPREDUCE-6635 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6635 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Sergey Shelukhin LineRecordReader creates the unsplittable reader like so: {noformat} in = new UncompressedSplitLineReader( fileIn, job, recordDelimiter, split.getLength()); {noformat} Split length goes to {noformat} private long splitLength; {noformat} At some point when reading the first line, fillBuffer does this: {noformat} @Override protected int fillBuffer(InputStream in, byte[] buffer, boolean inDelimiter) throws IOException { int maxBytesToRead = buffer.length; if (totalBytesRead < splitLength) { maxBytesToRead = Math.min(maxBytesToRead, (int)(splitLength - totalBytesRead)); {noformat} which will be a negative number for large splits, and the subsequent dfs read will fail with a boundary check. This has been reported here: https://issues.streamsets.com/browse/SDC-2229, also happens in Hive if very large text files are forced to be read in a single split (e.g. via header-skipping feature, or via set mapred.min.split.size=9999999999999999;) -- This message was sent by Atlassian JIRA (v6.3.4#6332)