[ https://issues.apache.org/jira/browse/BEAM-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013323#comment-16013323 ]
Jean-Baptiste Onofré commented on BEAM-2110: -------------------------------------------- I saw an issue (that looks a bit similar even if the exception is not the same) with the Spark runner, using {{HDFSFileSource}} with large files. As a workaround (and as {{HDFSFileSource}} will be deprecated soon), I would try with {{TextIO.read()}} with HDFS. Anyway, I'm investigating in the {{HDFSFileSource}}. > HDFSFileSource throws IndexOutOfBoundsException when trying to read big file > (gaming_data1.csv) > ----------------------------------------------------------------------------------------------- > > Key: BEAM-2110 > URL: https://issues.apache.org/jira/browse/BEAM-2110 > Project: Beam > Issue Type: Bug > Components: runner-flink, sdk-java-extensions > Environment: CentOS 7, Oracle JDK 8 > Reporter: Gergely Novák > Assignee: Jean-Baptiste Onofré > > I modified the wordcount example to read from HDFS with this code: > {code} > pipeline.apply(Read.from(HDFSFileSource.fromText(options.getInput()))) > {code} > This worked for a number of small files I tried with. But with the included > example: gs://apache-beam-samples/game/gaming_data*.csv (moved to HDFS) fails > with the following trace: > {noformat} > Caused by: java.lang.IndexOutOfBoundsException > at java.nio.Buffer.checkBounds(Buffer.java:567) > at java.nio.ByteBuffer.get(ByteBuffer.java:686) > at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:285) > at > org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:168) > at > org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:775) > at > org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:831) > at > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:891) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934) > at java.io.DataInputStream.read(DataInputStream.java:149) > at > org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:59) > at > org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216) > at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) > at > org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:91) > at > org.apache.hadoop.mapreduce.lib.input.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:144) > at > org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:184) > at > org.apache.beam.sdk.io.hdfs.HDFSFileSource$HDFSFileReader.advance(HDFSFileSource.java:492) > at > org.apache.beam.sdk.io.hdfs.HDFSFileSource$HDFSFileReader.start(HDFSFileSource.java:465) > at > org.apache.beam.runners.flink.metrics.ReaderInvocationUtil.invokeStart(ReaderInvocationUtil.java:50) > at > org.apache.beam.runners.flink.translation.wrappers.SourceInputFormat.open(SourceInputFormat.java:79) > at > org.apache.beam.runners.flink.translation.wrappers.SourceInputFormat.open(SourceInputFormat.java:45) > at > org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:144) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:655) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)