In your example it seems as though your HDFS configuration doesn't contain any ADL specific configuration: "--hdfsConfiguration='[{\"fs.defaultFS\": \"hdfs://home/sample.txt\"]'" Do you have a core-site.xml or hdfs-site.xml configured as per: https://hadoop.apache.org/docs/current/hadoop-azure-datalake/index.html?
>From the documentation for --hdfsConfiguration: A list of Hadoop configurations used to configure zero or more Hadoop filesystems. By default, Hadoop configuration is loaded from 'core-site.xml' and 'hdfs-site.xml based upon the HADOOP_CONF_DIR and YARN_CONF_DIR environment variables. To specify configuration on the command-line, represent the value as a JSON list of JSON maps, where each map represents the entire configuration for a single Hadoop filesystem. For example --hdfsConfiguration='[{\"fs.default.name\": \"hdfs://localhost:9998\", ...},{\"fs.default.name\": \"s3a://\", ...},...]' From: https://github.com/apache/beam/blob/9f81fd299bd32e0d6056a7da9fa994cf74db0ed9/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.java#L45 On Wed, Nov 22, 2017 at 1:12 AM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote: > Hi, > > FYI, I'm in touch with Microsoft Azure team about that. > > We are testing the ADLS support via HDFS. > > I keep you posted. > > Regards > JB > > On 11/22/2017 09:12 AM, Milan Chandna wrote: > >> Hi, >> >> Has anyone tried IO from(to) ADLS account on Beam with Spark runner? >> I was trying recently to do this but was unable to make it work. >> >> Steps that I tried: >> >> 1. Took HDI + Spark 1.6 cluster with default storage as ADLS account. >> 2. Built Apache Beam on that. Built to include Beam-2790< >> https://issues.apache.org/jira/browse/BEAM-2790> fix which earlier I was >> facing for ADL as well. >> 3. Modified WordCount.java example to use HadoopFileSystemOptions >> 4. Since HDI + Spark cluster has ADLS as defaultFS, tried 2 things >> * Just gave the input path and output path as >> adl://home/sample.txt and adl://home/output >> * In addition to adl input and output path, also gave required >> HDFS configuration with adl required configs as well. >> >> Both didn't worked btw. >> s >> 1. Have checked ACL's and permissions. In fact similar job with same >> paths work on Spark directly. >> 2. Issues faced: >> * For input, Beam is not able to find the path. Console log: >> Filepattern adl://home/sample.txt matched 0 files with total size 0 >> * Output path always gets converted to relative path, something >> like this: /home/user1/adl:/home/output/.tmp.... >> >> >> >> >> >> Debugging more into this but was checking if someone is already facing >> this and has some resolution. >> >> >> >> Here is a sample code and command I used. >> >> >> >> HadoopFileSystemOptions options = PipelineOptionsFactory.fromArg >> s(args).as(HadoopFileSystemOptions.class); >> >> Pipeline p = Pipeline.create(options); >> >> p.apply( TextIO.read().from(options.get >> HdfsConfiguration().get(0).get("fs.defaultFS"))) >> >> .apply(new CountWords()) >> >> .apply(MapElements.via(new FormatAsTextFn())) >> >> .apply(TextIO.write().to("adl://home/output")); >> >> p.run().waitUntilFinish(); >> >> >> >> >> >> spark-submit --class org.apache.beam.examples.WordCount --master local >> beam-examples-java-2.3.0-SNAPSHOT.jar --runner=SparkRunner >> --hdfsConfiguration='[{\"fs.defaultFS\": \"hdfs://home/sample.txt\"]' >> >> >> >> >> >> P.S: Created fat jar to use with spark just for testing. Is there any >> other correct way of running it with Spark runner? >> >> >> >> -Milan. >> >> > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com >