RE: Azure(ADLS) compatibility on Beam with Spark runner

Milan Chandna Wed, 22 Nov 2017 22:43:31 -0800

I tried both the ways.
Passed ADL specific configuration in --hdfsConfiguration as well and have setup 
the core-site.xml/hdfs-site.xml as well.
As I mentioned it's a HDI + Spark cluster, those things are already setup.
Spark job(without Beam) is also able to read and write to ADLS on same machine.


BTW if the authentication or understanding ADL was a problem, it would have 
thrown error like ADLFileSystem missing or probably access failed or something. 
Thoughts?

-Milan.

-----Original Message-----
From: Lukasz Cwik [mailto:[email protected]] 
Sent: Thursday, November 23, 2017 5:05 AM
To: [email protected]
Subject: Re: Azure(ADLS) compatibility on Beam with Spark runner

In your example it seems as though your HDFS configuration doesn't contain any 
ADL specific configuration:  "--hdfsConfiguration='[{\"fs.defaultFS\":
\"hdfs://home/sample.txt\"]'"
Do you have a core-site.xml or hdfs-site.xml configured as per:
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhadoop.apache.org%2Fdocs%2Fcurrent%2Fhadoop-azure-datalake%2Findex.html&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636469905161638292&sdata=Z%2FNJPDOZf5Xn6g9mVDfYdGiQKBPLJ1Gft8eka5W7Yts%3D&reserved=0?

From the documentation for --hdfsConfiguration:
A list of Hadoop configurations used to configure zero or more Hadoop 
filesystems. By default, Hadoop configuration is loaded from 'core-site.xml' 
and 'hdfs-site.xml based upon the HADOOP_CONF_DIR and YARN_CONF_DIR environment 
variables. To specify configuration on the command-line, represent the value as 
a JSON list of JSON maps, where each map represents the entire configuration 
for a single Hadoop filesystem. For example 
--hdfsConfiguration='[{\"fs.default.name\":
\"hdfs://localhost:9998\", ...},{\"fs.default.name\": \"s3a://\", ...},...]'
From:
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fbeam%2Fblob%2F9f81fd299bd32e0d6056a7da9fa994cf74db0ed9%2Fsdks%2Fjava%2Fio%2Fhadoop-file-system%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2Fhdfs%2FHadoopFileSystemOptions.java%23L45&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636469905161638292&sdata=tL3UzNW4OBuFa1LMIzZsyR8eSqBoZ7hWVJipnznrQ5Q%3D&reserved=0

On Wed, Nov 22, 2017 at 1:12 AM, Jean-Baptiste Onofré <[email protected]>
wrote:

> Hi,
>
> FYI, I'm in touch with Microsoft Azure team about that.
>
> We are testing the ADLS support via HDFS.
>
> I keep you posted.
>
> Regards
> JB
>
> On 11/22/2017 09:12 AM, Milan Chandna wrote:
>
>> Hi,
>>
>> Has anyone tried IO from(to) ADLS account on Beam with Spark runner?
>> I was trying recently to do this but was unable to make it work.
>>
>> Steps that I tried:
>>
>>    1.  Took HDI + Spark 1.6 cluster with default storage as ADLS account.
>>    2.  Built Apache Beam on that. Built to include Beam-2790< 
>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissu
>> es.apache.org%2Fjira%2Fbrowse%2FBEAM-2790&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636469905161638292&sdata=aj%2FlaXlhlOQtnlRqHh8yLs2KfOZuRwDUUFvTpLB3Atg%3D&reserved=0>
>>  fix which earlier I was facing for ADL as well.
>>    3.  Modified WordCount.java example to use HadoopFileSystemOptions
>>    4.  Since HDI + Spark cluster has ADLS as defaultFS, tried 2 things
>>       *   Just gave the input path and output path as
>> adl://home/sample.txt and adl://home/output
>>       *   In addition to adl input and output path, also gave required
>> HDFS configuration with adl required configs as well.
>>
>> Both didn't worked btw.
>> s
>>    1.  Have checked ACL's and permissions. In fact similar job with 
>> same paths work on Spark directly.
>>    2.  Issues faced:
>>       *   For input, Beam is not able to find the path. Console log:
>> Filepattern adl://home/sample.txt matched 0 files with total size 0
>>       *   Output path always gets converted to relative path, something
>> like this: /home/user1/adl:/home/output/.tmp....
>>
>>
>>
>>
>>
>> Debugging more into this but was checking if someone is already 
>> facing this and has some resolution.
>>
>>
>>
>> Here is a sample code and command I used.
>>
>>
>>
>>      HadoopFileSystemOptions options = PipelineOptionsFactory.fromArg 
>> s(args).as(HadoopFileSystemOptions.class);
>>
>>      Pipeline p = Pipeline.create(options);
>>
>>      p.apply( TextIO.read().from(options.get
>> HdfsConfiguration().get(0).get("fs.defaultFS")))
>>
>>       .apply(new CountWords())
>>
>>       .apply(MapElements.via(new FormatAsTextFn()))
>>
>>       .apply(TextIO.write().to("adl://home/output"));
>>
>>      p.run().waitUntilFinish();
>>
>>
>>
>>
>>
>> spark-submit --class org.apache.beam.examples.WordCount --master 
>> local beam-examples-java-2.3.0-SNAPSHOT.jar --runner=SparkRunner
>> --hdfsConfiguration='[{\"fs.defaultFS\": \"hdfs://home/sample.txt\"]'
>>
>>
>>
>>
>>
>> P.S: Created fat jar to use with spark just for testing. Is there any 
>> other correct way of running it with Spark runner?
>>
>>
>>
>> -Milan.
>>
>>
> --
> Jean-Baptiste Onofré
> [email protected]
> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.n
> anthrax.net&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bf
> e44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636
> 469905161638292&sdata=hGdhEl7i96JqoVssihvKTTSlrxAGum9z%2FvdhziXWop4%3D
> &reserved=0 Talend - 
> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.ta
> lend.com&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44
> df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636469
> 905161638292&sdata=xFtW3%2Bw1f7HX76gTqjcdJVrkJjekH96TIcYpVsamuyc%3D&re
> served=0
>

RE: Azure(ADLS) compatibility on Beam with Spark runner

Reply via email to