[ 
https://issues.apache.org/jira/browse/BEAM-3945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17132258#comment-17132258
 ] 

Beam JIRA Bot commented on BEAM-3945:
-------------------------------------

This issue is assigned but has not received an update in 30 days so it has been 
labeled "stale-assigned". If you are still working on the issue, please give an 
update and remove the label. If you are no longer working on the issue, please 
unassign so someone else may work on it. In 7 days the issue will be 
automatically unassigned.

> TFRecord Performance Tests doesn't work on hdfs
> -----------------------------------------------
>
>                 Key: BEAM-3945
>                 URL: https://issues.apache.org/jira/browse/BEAM-3945
>             Project: Beam
>          Issue Type: Bug
>          Components: io-java-tfrecord, testing
>            Reporter: Kamil Szewczyk
>            Assignee: Udi Meiri
>            Priority: P3
>              Labels: stale-assigned
>
> TFRecord have issue reading files from hdfs using filename pattern 
> _"hdfs://...*"_
> {code:java}
> TFRecordIO.read().from(filenamePattern).withCompression(AUTO){code}
> [link to 
> github|https://github.com/apache/beam/blob/36257aba9054e664ebaafccfefb78bf54a162618/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/tfrecord/TFRecordIOIT.java#L113]
>  this is a blocker for running full set of filebased io tests on hdfs.
> Steps to reproduce:
>  1. Create remote hadoop environment. This step asume you have local kubectl 
> tool configured to use your GCP project.
> {code}
> pushd .test-infra/kubernetes/hadoop/SmallITCluster/ && /bin/bash 
> ./setup-all.sh && popd
> {code}
> 2. Update /etc/hosts file with the provided output from sctipt.
> 3. Confirm that it works and hadoop web interface is accessible on 
> {color:#FF0000}http://hadoop-xxxxx:50070{color} where xxxxx is added in step2 
> sequence from your /etc/hosts entry. Please also substitute xxxxx in further 
> usages of this.
>  4. Tell runner to use root as hadoop user.
> {code}
> export HADOOP_USER_NAME=root
> {code}
> 5. Run TFRecord tests on this environment using DirectRunner:
> {code}
> mvn -e -Pio-it verify -pl sdks/java/io/file-based-io-tests/ 
> -Dit.test=org.apache.beam.sdk.io.tfrecord.TFRecordIOIT -Dfilesystem=hdfs 
> -DintegrationTestPipelineOptions='["--filenamePrefix=hdfs://hadoop-xxxxx:9000/TFRecord",
>  "--hdfsConfiguration=[{\"fs.defaultFS\" : \"hdfs://hadoop-xxxxx:9000\", 
> \"dfs.replication\": 1, \"dfs.client.use.datanode.hostname\":\"true\"}]" ]' 
> -DforceDirectRunner=true 
> {code}
> The error message is:
> {code:bash}
> [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 78.055 s <<< FAILURE! - in org.apache.beam.sdk.io.tfrecord.TFRecordIOIT
> [ERROR] writeThenReadAll(org.apache.beam.sdk.io.tfrecord.TFRecordIOIT)  Time 
> elapsed: 78.055 s  <<< ERROR!
> java.lang.IllegalStateException: Invalid data
>         at 
> org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkState(Preconditions.java:444)
>         at 
> org.apache.beam.sdk.io.TFRecordIO$TFRecordCodec.read(TFRecordIO.java:642)
>         at 
> org.apache.beam.sdk.io.TFRecordIO$TFRecordSource$TFRecordReader.readNextRecord(TFRecordIO.java:526)
>         at 
> org.apache.beam.sdk.io.CompressedSource$CompressedReader.readNextRecord(CompressedSource.java:426)
>         at 
> org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.advanceImpl(FileBasedSource.java:473)
>         at 
> org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.advance(OffsetBasedSource.java:267)
>         at 
> org.apache.beam.runners.direct.BoundedReadEvaluatorFactory$BoundedReadEvaluator.processElement(BoundedReadEvaluatorFactory.java:148)
>         at 
> org.apache.beam.runners.direct.DirectTransformExecutor.processElements(DirectTransformExecutor.java:161)
>         at 
> org.apache.beam.runners.direct.DirectTransformExecutor.run(DirectTransformExecutor.java:125)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> This results were also observed when running tests on jenkins. [Link to 
> jenkins 
> build|https://builds.apache.org/view/A-D/view/Beam/job/beam_PerformanceTests_TFRecordIOIT_HDFS/3/console]
> When you open http://hadoop-xxxxx:50070/explorer.html#/ you will see TFRecord 
> files that were created during write phase. Unable to be processed in reading 
> phase.
> {color:red}Important note{color}: if I copy files made by writing pipeline 
> from hdfs directory to local directory and run reading pipeline over them, 
> everything is working fine, so only reading from hdfs is a problem.
> You can wipe out hdfs environment by runnning:
> {code}
> pushd .test-infra/kubernetes/hadoop/SmallITCluster/ && /bin/bash 
> ./teardown-all.sh && popd
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to