Hello, I'm testing some pipelines on a dataproc cluster with hadoop version 2.8.2, beam 2.3.0-SNAPSHOT. I have observed on our pipeline as well as the wordcount that ships with beam, that FileBasedSource does not "match" any files when using hdfs prefix - verified this with apex runner and direct runner. Local fs and GoogleHadoopFileSystem work fine. HDFS files access is verified from all worker nodes for all users from cli.
In the logs (console for direct runner, apex.log from one of the containers for apex runner): INFO org.apache.beam.sdk.io.FileBasedSource: Matched 0 files for pattern hdfs:///tmp/input/ Tried numerous versions of the same uri. For example: INFO org.apache.beam.sdk.io.FileBasedSource: Matched 0 files for pattern hdfs://cluster-m/tmp/input/twitter.avro INFO org.apache.beam.sdk.io.FileBasedSource: Matched 0 files for pattern hdfs://mycluster-m/tmp/input/twitter.avro Works for gcs files: INFO org.apache.beam.sdk.io.FileBasedSource: Matched 1 files for pattern gs://mybucket/input/twitter/twitter.avro To reproduce, use beam examples archetype, package and execute: mvn archetype:generate -DarchetypeRepository= https://repository.apache.org/content/groups/snapshots -DarchetypeGroupId=org.apache.beam -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples -DarchetypeVersion=LATEST -DgroupId=org.example -DartifactId=word-count-beam -Dversion="0.1" -Dpackage=org.apache.beam.examples -DinteractiveMode=false cd word-count-beam mvn clean package -Papex-runner -DskipTests yarn jar target/word-count-beam-bundled-0.1.jar org.apache.beam.examples.WordCount --inputFile=hdfs:///tmp/input/pom.xml --output=/tmp/output --runner=ApexRunner --embeddedExecution=false Note: "mvn compile exec:java ..." would not work for me due to classpath/version-compat issues. Also needed to exclude org.apache.hadoop:* and com.google.cloud.bigdataoss:* from shaded jar for version compat. Appreciate any help. Regards, Shashank
