RE: Azure(ADLS) compatibility on Beam with Spark runner

Milan Chandna Fri, 24 Nov 2017 20:10:12 -0800

Hi JB,

Thanks for the updates.
BTW I am myself in Microsoft but I am trying this out of my interest.
And it's good to know that someone else is also working on this.


-Milan.

-----Original Message-----
From: Jean-Baptiste Onofré [mailto:[email protected]] 
Sent: Thursday, November 23, 2017 1:47 PM
To: [email protected]
Subject: Re: Azure(ADLS) compatibility on Beam with Spark runner

The Azure guys tried to use ADLS via Beam HDFS filesystem, but it seems they 
didn't succeed.
The new approach we plan is to directly use the ADLS API.

I keep you posted.

Regards
JB

On 11/23/2017 07:42 AM, Milan Chandna wrote:
> I tried both the ways.
> Passed ADL specific configuration in --hdfsConfiguration as well and have 
> setup the core-site.xml/hdfs-site.xml as well.
> As I mentioned it's a HDI + Spark cluster, those things are already setup.
> Spark job(without Beam) is also able to read and write to ADLS on same 
> machine.
> 
> BTW if the authentication or understanding ADL was a problem, it would have 
> thrown error like ADLFileSystem missing or probably access failed or 
> something. Thoughts?
> 
> -Milan.
> 
> -----Original Message-----
> From: Lukasz Cwik [mailto:[email protected]]
> Sent: Thursday, November 23, 2017 5:05 AM
> To: [email protected]
> Subject: Re: Azure(ADLS) compatibility on Beam with Spark runner
> 
> In your example it seems as though your HDFS configuration doesn't contain 
> any ADL specific configuration:  "--hdfsConfiguration='[{\"fs.defaultFS\":
> \"hdfs://home/sample.txt\"]'"
> Do you have a core-site.xml or hdfs-site.xml configured as per:
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhadoop.apache.org%2Fdocs%2Fcurrent%2Fhadoop-azure-datalake%2Findex.html&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636469905161638292&sdata=Z%2FNJPDOZf5Xn6g9mVDfYdGiQKBPLJ1Gft8eka5W7Yts%3D&reserved=0?
> 
>  From the documentation for --hdfsConfiguration:
> A list of Hadoop configurations used to configure zero or more Hadoop 
> filesystems. By default, Hadoop configuration is loaded from 'core-site.xml' 
> and 'hdfs-site.xml based upon the HADOOP_CONF_DIR and YARN_CONF_DIR 
> environment variables. To specify configuration on the command-line, 
> represent the value as a JSON list of JSON maps, where each map represents 
> the entire configuration for a single Hadoop filesystem. For example 
> --hdfsConfiguration='[{\"fs.default.name\":
> \"hdfs://localhost:9998\", ...},{\"fs.default.name\": \"s3a://\", ...},...]'
> From:
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithu
> b.com%2Fapache%2Fbeam%2Fblob%2F9f81fd299bd32e0d6056a7da9fa994cf74db0ed
> 9%2Fsdks%2Fjava%2Fio%2Fhadoop-file-system%2Fsrc%2Fmain%2Fjava%2Forg%2F
> apache%2Fbeam%2Fsdk%2Fio%2Fhdfs%2FHadoopFileSystemOptions.java%23L45&d
> ata=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d5
> 3201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6364699051616382
> 92&sdata=tL3UzNW4OBuFa1LMIzZsyR8eSqBoZ7hWVJipnznrQ5Q%3D&reserved=0
> 
> On Wed, Nov 22, 2017 at 1:12 AM, Jean-Baptiste Onofré 
> <[email protected]>
> wrote:
> 
>> Hi,
>>
>> FYI, I'm in touch with Microsoft Azure team about that.
>>
>> We are testing the ADLS support via HDFS.
>>
>> I keep you posted.
>>
>> Regards
>> JB
>>
>> On 11/22/2017 09:12 AM, Milan Chandna wrote:
>>
>>> Hi,
>>>
>>> Has anyone tried IO from(to) ADLS account on Beam with Spark runner?
>>> I was trying recently to do this but was unable to make it work.
>>>
>>> Steps that I tried:
>>>
>>>     1.  Took HDI + Spark 1.6 cluster with default storage as ADLS account.
>>>     2.  Built Apache Beam on that. Built to include Beam-2790< 
>>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fiss
>>> u 
>>> es.apache.org%2Fjira%2Fbrowse%2FBEAM-2790&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636469905161638292&sdata=aj%2FlaXlhlOQtnlRqHh8yLs2KfOZuRwDUUFvTpLB3Atg%3D&reserved=0>
>>>  fix which earlier I was facing for ADL as well.
>>>     3.  Modified WordCount.java example to use HadoopFileSystemOptions
>>>     4.  Since HDI + Spark cluster has ADLS as defaultFS, tried 2 things
>>>        *   Just gave the input path and output path as
>>> adl://home/sample.txt and adl://home/output
>>>        *   In addition to adl input and output path, also gave required
>>> HDFS configuration with adl required configs as well.
>>>
>>> Both didn't worked btw.
>>> s
>>>     1.  Have checked ACL's and permissions. In fact similar job with 
>>> same paths work on Spark directly.
>>>     2.  Issues faced:
>>>        *   For input, Beam is not able to find the path. Console log:
>>> Filepattern adl://home/sample.txt matched 0 files with total size 0
>>>        *   Output path always gets converted to relative path, something
>>> like this: /home/user1/adl:/home/output/.tmp....
>>>
>>>
>>>
>>>
>>>
>>> Debugging more into this but was checking if someone is already 
>>> facing this and has some resolution.
>>>
>>>
>>>
>>> Here is a sample code and command I used.
>>>
>>>
>>>
>>>       HadoopFileSystemOptions options = 
>>> PipelineOptionsFactory.fromArg 
>>> s(args).as(HadoopFileSystemOptions.class);
>>>
>>>       Pipeline p = Pipeline.create(options);
>>>
>>>       p.apply( TextIO.read().from(options.get
>>> HdfsConfiguration().get(0).get("fs.defaultFS")))
>>>
>>>        .apply(new CountWords())
>>>
>>>        .apply(MapElements.via(new FormatAsTextFn()))
>>>
>>>        .apply(TextIO.write().to("adl://home/output"));
>>>
>>>       p.run().waitUntilFinish();
>>>
>>>
>>>
>>>
>>>
>>> spark-submit --class org.apache.beam.examples.WordCount --master 
>>> local beam-examples-java-2.3.0-SNAPSHOT.jar --runner=SparkRunner
>>> --hdfsConfiguration='[{\"fs.defaultFS\": \"hdfs://home/sample.txt\"]'
>>>
>>>
>>>
>>>
>>>
>>> P.S: Created fat jar to use with spark just for testing. Is there 
>>> any other correct way of running it with Spark runner?
>>>
>>>
>>>
>>> -Milan.
>>>
>>>
>> --
>> Jean-Baptiste Onofré
>> [email protected]
>> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.
>> n 
>> anthrax.net&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26b
>> f
>> e44df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C63
>> 6 
>> 469905161638292&sdata=hGdhEl7i96JqoVssihvKTTSlrxAGum9z%2FvdhziXWop4%3
>> D
>> &reserved=0 Talend -
>> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.t
>> a
>> lend.com&data=02%7C01%7CMilan.Chandna%40microsoft.com%7Cb7dffcc26bfe4
>> 4
>> df589a08d53201aeab%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C63646
>> 9 
>> 905161638292&sdata=xFtW3%2Bw1f7HX76gTqjcdJVrkJjekH96TIcYpVsamuyc%3D&r
>> e
>> served=0
>>

--
Jean-Baptiste Onofré
[email protected]
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.nanthrax.net&data=02%7C01%7CMilan.Chandna%40microsoft.com%7C2a601c3969654f521ad408d5324aa55c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636470218559509648&sdata=fyXe5VgLfGB4BJCUf3frBffZ1JnPoVwFA1d4iYETBQg%3D&reserved=0
Talend - 
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.talend.com&data=02%7C01%7CMilan.Chandna%40microsoft.com%7C2a601c3969654f521ad408d5324aa55c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636470218559509648&sdata=aJmwQZINidhmI%2F6qqS2sI2GF%2BkQsnG%2FLGgtCiVmMHTs%3D&reserved=0

RE: Azure(ADLS) compatibility on Beam with Spark runner

Reply via email to