I tried both the ways. Passed ADL specific configuration in --hdfsConfiguration as well and have setup the core-site.xml/hdfs-site.xml as well. As I mentioned it's a HDI + Spark cluster, those things are already setup. Spark job(without Beam) is also able to read and write to ADLS on same machine.
BTW if the authentication or understanding ADL was a problem, it would have thrown error like ADLFileSystem missing or probably access failed or something. Thoughts? -Milan. -----Original Message----- From: Lukasz Cwik [mailto:[email protected]] Sent: Thursday, November 23, 2017 5:05 AM To: [email protected] Subject: Re: Azure(ADLS) compatibility on Beam with Spark runner In your example it seems as though your HDFS configuration doesn't contain any ADL specific configuration: "--hdfsConfiguration='[{\"fs.defaultFS\": \"hdfs://home/sample.txt\"]'" Do you have a core-site.xml or hdfs-site.xml configured as per: https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhadoop.apache.org%2Fdocs%2Fcurrent%2Fhadoop-azure-datalake%2Findex.html&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636469905161638292&sdata=Z%2FNJPDOZf5Xn6g9mVDfYdGiQKBPLJ1Gft8eka5W7Yts%3D&reserved=0? From the documentation for --hdfsConfiguration: A list of Hadoop configurations used to configure zero or more Hadoop filesystems. By default, Hadoop configuration is loaded from 'core-site.xml' and 'hdfs-site.xml based upon the HADOOP_CONF_DIR and YARN_CONF_DIR environment variables. To specify configuration on the command-line, represent the value as a JSON list of JSON maps, where each map represents the entire configuration for a single Hadoop filesystem. For example --hdfsConfiguration='[{\"fs.default.name\": \"hdfs://localhost:9998\", ...},{\"fs.default.name\": \"s3a://\", ...},...]' From: https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fbeam%2Fblob%2F9f81fd299bd32e0d6056a7da9fa994cf74db0ed9%2Fsdks%2Fjava%2Fio%2Fhadoop-file-system%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2Fhdfs%2FHadoopFileSystemOptions.java%23L45&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636469905161638292&sdata=tL3UzNW4OBuFa1LMIzZsyR8eSqBoZ7hWVJipnznrQ5Q%3D&reserved=0 On Wed, Nov 22, 2017 at 1:12 AM, Jean-Baptiste Onofré <[email protected]> wrote: > Hi, > > FYI, I'm in touch with Microsoft Azure team about that. > > We are testing the ADLS support via HDFS. > > I keep you posted. > > Regards > JB > > On 11/22/2017 09:12 AM, Milan Chandna wrote: > >> Hi, >> >> Has anyone tried IO from(to) ADLS account on Beam with Spark runner? >> I was trying recently to do this but was unable to make it work. >> >> Steps that I tried: >> >> 1. Took HDI + Spark 1.6 cluster with default storage as ADLS account. >> 2. Built Apache Beam on that. Built to include Beam-2790< >> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissu >> es.apache.org%2Fjira%2Fbrowse%2FBEAM-2790&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636469905161638292&sdata=aj%2FlaXlhlOQtnlRqHh8yLs2KfOZuRwDUUFvTpLB3Atg%3D&reserved=0> >> fix which earlier I was facing for ADL as well. >> 3. Modified WordCount.java example to use HadoopFileSystemOptions >> 4. Since HDI + Spark cluster has ADLS as defaultFS, tried 2 things >> * Just gave the input path and output path as >> adl://home/sample.txt and adl://home/output >> * In addition to adl input and output path, also gave required >> HDFS configuration with adl required configs as well. >> >> Both didn't worked btw. >> s >> 1. Have checked ACL's and permissions. In fact similar job with >> same paths work on Spark directly. >> 2. Issues faced: >> * For input, Beam is not able to find the path. Console log: >> Filepattern adl://home/sample.txt matched 0 files with total size 0 >> * Output path always gets converted to relative path, something >> like this: /home/user1/adl:/home/output/.tmp.... >> >> >> >> >> >> Debugging more into this but was checking if someone is already >> facing this and has some resolution. >> >> >> >> Here is a sample code and command I used. >> >> >> >> HadoopFileSystemOptions options = PipelineOptionsFactory.fromArg >> s(args).as(HadoopFileSystemOptions.class); >> >> Pipeline p = Pipeline.create(options); >> >> p.apply( TextIO.read().from(options.get >> HdfsConfiguration().get(0).get("fs.defaultFS"))) >> >> .apply(new CountWords()) >> >> .apply(MapElements.via(new FormatAsTextFn())) >> >> .apply(TextIO.write().to("adl://home/output")); >> >> p.run().waitUntilFinish(); >> >> >> >> >> >> spark-submit --class org.apache.beam.examples.WordCount --master >> local beam-examples-java-2.3.0-SNAPSHOT.jar --runner=SparkRunner >> --hdfsConfiguration='[{\"fs.defaultFS\": \"hdfs://home/sample.txt\"]' >> >> >> >> >> >> P.S: Created fat jar to use with spark just for testing. Is there any >> other correct way of running it with Spark runner? >> >> >> >> -Milan. >> >> > -- > Jean-Baptiste Onofré > [email protected] > https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.n > anthrax.net&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bf > e44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636 > 469905161638292&sdata=hGdhEl7i96JqoVssihvKTTSlrxAGum9z%2FvdhziXWop4%3D > &reserved=0 Talend - > https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.ta > lend.com&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44 > df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636469 > 905161638292&sdata=xFtW3%2Bw1f7HX76gTqjcdJVrkJjekH96TIcYpVsamuyc%3D&re > served=0 >
