[jira] [Commented] (BEAM-2110) HDFSFileSource throws IndexOutOfBoundsException when trying to read big file (gaming_data1.csv)

2017-05-16 Thread Daniel Halperin (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16013330#comment-16013330
 ] 

Daniel Halperin commented on BEAM-2110:
---

{{HDFSFileSource}} has been deleted in version 2.0.0, as all {{FileBasedSource} 
implementations such as {{TextIO}} and {{AvroIO}} work using {{hdfs://}} paths 
and the {{HadoopFileSystem}}.

I think we can close this bug as obsolete.

> HDFSFileSource throws IndexOutOfBoundsException when trying to read big file 
> (gaming_data1.csv)
> ---
>
> Key: BEAM-2110
> URL: https://issues.apache.org/jira/browse/BEAM-2110
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink, sdk-java-extensions
> Environment: CentOS 7, Oracle JDK 8
>Reporter: Gergely Novák
>Assignee: Jean-Baptiste Onofré
>
> I modified the wordcount example to read from HDFS with this code:
> {code}
> pipeline.apply(Read.from(HDFSFileSource.fromText(options.getInput(
> {code}
> This worked for a number of small files I tried with. But with the included 
> example: gs://apache-beam-samples/game/gaming_data*.csv (moved to HDFS) fails 
> with the following trace:
> {noformat}
> Caused by: java.lang.IndexOutOfBoundsException
>   at java.nio.Buffer.checkBounds(Buffer.java:567)
>   at java.nio.ByteBuffer.get(ByteBuffer.java:686)
>   at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:285)
>   at 
> org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:168)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:775)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:831)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:891)
>   at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at 
> org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:59)
>   at 
> org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
>   at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
>   at 
> org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:91)
>   at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:144)
>   at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:184)
>   at 
> org.apache.beam.sdk.io.hdfs.HDFSFileSource$HDFSFileReader.advance(HDFSFileSource.java:492)
>   at 
> org.apache.beam.sdk.io.hdfs.HDFSFileSource$HDFSFileReader.start(HDFSFileSource.java:465)
>   at 
> org.apache.beam.runners.flink.metrics.ReaderInvocationUtil.invokeStart(ReaderInvocationUtil.java:50)
>   at 
> org.apache.beam.runners.flink.translation.wrappers.SourceInputFormat.open(SourceInputFormat.java:79)
>   at 
> org.apache.beam.runners.flink.translation.wrappers.SourceInputFormat.open(SourceInputFormat.java:45)
>   at 
> org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:144)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:655)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-2110) HDFSFileSource throws IndexOutOfBoundsException when trying to read big file (gaming_data1.csv)

2017-05-16 Thread JIRA

[ 
https://issues.apache.org/jira/browse/BEAM-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16013323#comment-16013323
 ] 

Jean-Baptiste Onofré commented on BEAM-2110:


I saw an issue (that looks a bit similar even if the exception is not the same) 
with the Spark runner, using {{HDFSFileSource}} with large files. As a 
workaround (and as {{HDFSFileSource}} will be deprecated soon), I would try 
with {{TextIO.read()}} with HDFS. Anyway, I'm investigating in the 
{{HDFSFileSource}}.

> HDFSFileSource throws IndexOutOfBoundsException when trying to read big file 
> (gaming_data1.csv)
> ---
>
> Key: BEAM-2110
> URL: https://issues.apache.org/jira/browse/BEAM-2110
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink, sdk-java-extensions
> Environment: CentOS 7, Oracle JDK 8
>Reporter: Gergely Novák
>Assignee: Jean-Baptiste Onofré
>
> I modified the wordcount example to read from HDFS with this code:
> {code}
> pipeline.apply(Read.from(HDFSFileSource.fromText(options.getInput(
> {code}
> This worked for a number of small files I tried with. But with the included 
> example: gs://apache-beam-samples/game/gaming_data*.csv (moved to HDFS) fails 
> with the following trace:
> {noformat}
> Caused by: java.lang.IndexOutOfBoundsException
>   at java.nio.Buffer.checkBounds(Buffer.java:567)
>   at java.nio.ByteBuffer.get(ByteBuffer.java:686)
>   at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:285)
>   at 
> org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:168)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:775)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:831)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:891)
>   at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at 
> org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:59)
>   at 
> org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
>   at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
>   at 
> org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:91)
>   at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:144)
>   at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:184)
>   at 
> org.apache.beam.sdk.io.hdfs.HDFSFileSource$HDFSFileReader.advance(HDFSFileSource.java:492)
>   at 
> org.apache.beam.sdk.io.hdfs.HDFSFileSource$HDFSFileReader.start(HDFSFileSource.java:465)
>   at 
> org.apache.beam.runners.flink.metrics.ReaderInvocationUtil.invokeStart(ReaderInvocationUtil.java:50)
>   at 
> org.apache.beam.runners.flink.translation.wrappers.SourceInputFormat.open(SourceInputFormat.java:79)
>   at 
> org.apache.beam.runners.flink.translation.wrappers.SourceInputFormat.open(SourceInputFormat.java:45)
>   at 
> org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:144)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:655)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-2110) HDFSFileSource throws IndexOutOfBoundsException when trying to read big file (gaming_data1.csv)

2017-04-28 Thread JIRA

[ 
https://issues.apache.org/jira/browse/BEAM-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15989768#comment-15989768
 ] 

Jean-Baptiste Onofré commented on BEAM-2110:


Sure, let me take a look. Thanks !

> HDFSFileSource throws IndexOutOfBoundsException when trying to read big file 
> (gaming_data1.csv)
> ---
>
> Key: BEAM-2110
> URL: https://issues.apache.org/jira/browse/BEAM-2110
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink, sdk-java-extensions
> Environment: CentOS 7, Oracle JDK 8
>Reporter: Gergely Novák
>Assignee: Jean-Baptiste Onofré
>
> I modified the wordcount example to read from HDFS with this code:
> {code}
> pipeline.apply(Read.from(HDFSFileSource.fromText(options.getInput(
> {code}
> This worked for a number of small files I tried with. But with the included 
> example: gs://apache-beam-samples/game/gaming_data*.csv (moved to HDFS) fails 
> with the following trace:
> {noformat}
> Caused by: java.lang.IndexOutOfBoundsException
>   at java.nio.Buffer.checkBounds(Buffer.java:567)
>   at java.nio.ByteBuffer.get(ByteBuffer.java:686)
>   at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:285)
>   at 
> org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:168)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:775)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:831)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:891)
>   at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at 
> org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:59)
>   at 
> org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
>   at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
>   at 
> org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:91)
>   at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:144)
>   at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:184)
>   at 
> org.apache.beam.sdk.io.hdfs.HDFSFileSource$HDFSFileReader.advance(HDFSFileSource.java:492)
>   at 
> org.apache.beam.sdk.io.hdfs.HDFSFileSource$HDFSFileReader.start(HDFSFileSource.java:465)
>   at 
> org.apache.beam.runners.flink.metrics.ReaderInvocationUtil.invokeStart(ReaderInvocationUtil.java:50)
>   at 
> org.apache.beam.runners.flink.translation.wrappers.SourceInputFormat.open(SourceInputFormat.java:79)
>   at 
> org.apache.beam.runners.flink.translation.wrappers.SourceInputFormat.open(SourceInputFormat.java:45)
>   at 
> org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:144)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:655)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-2110) HDFSFileSource throws IndexOutOfBoundsException when trying to read big file (gaming_data1.csv)

2017-04-28 Thread Davor Bonaci (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15989117#comment-15989117
 ] 

Davor Bonaci commented on BEAM-2110:


[~jbonofre], any chance you'd be able to help with this exception?

> HDFSFileSource throws IndexOutOfBoundsException when trying to read big file 
> (gaming_data1.csv)
> ---
>
> Key: BEAM-2110
> URL: https://issues.apache.org/jira/browse/BEAM-2110
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink, sdk-java-extensions
> Environment: CentOS 7, Oracle JDK 8
>Reporter: Gergely Novák
>Assignee: Jean-Baptiste Onofré
>
> I modified the wordcount example to read from HDFS with this code:
> {code}
> pipeline.apply(Read.from(HDFSFileSource.fromText(options.getInput(
> {code}
> This worked for a number of small files I tried with. But with the included 
> example: gs://apache-beam-samples/game/gaming_data*.csv (moved to HDFS) fails 
> with the following trace:
> {noformat}
> Caused by: java.lang.IndexOutOfBoundsException
>   at java.nio.Buffer.checkBounds(Buffer.java:567)
>   at java.nio.ByteBuffer.get(ByteBuffer.java:686)
>   at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:285)
>   at 
> org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:168)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:775)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:831)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:891)
>   at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at 
> org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:59)
>   at 
> org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
>   at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
>   at 
> org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:91)
>   at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:144)
>   at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:184)
>   at 
> org.apache.beam.sdk.io.hdfs.HDFSFileSource$HDFSFileReader.advance(HDFSFileSource.java:492)
>   at 
> org.apache.beam.sdk.io.hdfs.HDFSFileSource$HDFSFileReader.start(HDFSFileSource.java:465)
>   at 
> org.apache.beam.runners.flink.metrics.ReaderInvocationUtil.invokeStart(ReaderInvocationUtil.java:50)
>   at 
> org.apache.beam.runners.flink.translation.wrappers.SourceInputFormat.open(SourceInputFormat.java:79)
>   at 
> org.apache.beam.runners.flink.translation.wrappers.SourceInputFormat.open(SourceInputFormat.java:45)
>   at 
> org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:144)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:655)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)