Hi,

FYI, I'm in touch with Microsoft Azure team about that.

We are testing the ADLS support via HDFS.

I keep you posted.

Regards
JB

On 11/22/2017 09:12 AM, Milan Chandna wrote:
Hi,

Has anyone tried IO from(to) ADLS account on Beam with Spark runner?
I was trying recently to do this but was unable to make it work.

Steps that I tried:

   1.  Took HDI + Spark 1.6 cluster with default storage as ADLS account.
   2.  Built Apache Beam on that. Built to include 
Beam-2790<https://issues.apache.org/jira/browse/BEAM-2790> fix which earlier I 
was facing for ADL as well.
   3.  Modified WordCount.java example to use HadoopFileSystemOptions
   4.  Since HDI + Spark cluster has ADLS as defaultFS, tried 2 things
      *   Just gave the input path and output path as adl://home/sample.txt and 
adl://home/output
      *   In addition to adl input and output path, also gave required HDFS 
configuration with adl required configs as well.

Both didn't worked btw.
s
   1.  Have checked ACL's and permissions. In fact similar job with same paths 
work on Spark directly.
   2.  Issues faced:
      *   For input, Beam is not able to find the path. Console log: 
Filepattern adl://home/sample.txt matched 0 files with total size 0
      *   Output path always gets converted to relative path, something like 
this: /home/user1/adl:/home/output/.tmp....





Debugging more into this but was checking if someone is already facing this and 
has some resolution.



Here is a sample code and command I used.



     HadoopFileSystemOptions options = 
PipelineOptionsFactory.fromArgs(args).as(HadoopFileSystemOptions.class);

     Pipeline p = Pipeline.create(options);

     p.apply( 
TextIO.read().from(options.getHdfsConfiguration().get(0).get("fs.defaultFS")))

      .apply(new CountWords())

      .apply(MapElements.via(new FormatAsTextFn()))

      .apply(TextIO.write().to("adl://home/output"));

     p.run().waitUntilFinish();





spark-submit --class org.apache.beam.examples.WordCount --master local 
beam-examples-java-2.3.0-SNAPSHOT.jar --runner=SparkRunner 
--hdfsConfiguration='[{\"fs.defaultFS\": \"hdfs://home/sample.txt\"]'





P.S: Created fat jar to use with spark just for testing. Is there any other 
correct way of running it with Spark runner?



-Milan.


--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com

Reply via email to