Re: Azure(ADLS) compatibility on Beam with Spark runner

Lukasz Cwik Wed, 22 Nov 2017 15:36:08 -0800

In your example it seems as though your HDFS configuration doesn't contain
any ADL specific configuration:  "--hdfsConfiguration='[{\"fs.defaultFS\":
\"hdfs://home/sample.txt\"]'"
Do you have a core-site.xml or hdfs-site.xml configured as per:
https://hadoop.apache.org/docs/current/hadoop-azure-datalake/index.html?


>From the documentation for --hdfsConfiguration:
A list of Hadoop configurations used to configure zero or more Hadoop
filesystems. By default, Hadoop configuration is loaded from
'core-site.xml' and 'hdfs-site.xml based upon the HADOOP_CONF_DIR and
YARN_CONF_DIR environment variables. To specify configuration on the
command-line, represent the value as a JSON list of JSON maps, where each
map represents the entire configuration for a single Hadoop filesystem. For
example --hdfsConfiguration='[{\"fs.default.name\":
\"hdfs://localhost:9998\", ...},{\"fs.default.name\": \"s3a://\", ...},...]'
From:
https://github.com/apache/beam/blob/9f81fd299bd32e0d6056a7da9fa994cf74db0ed9/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.java#L45

On Wed, Nov 22, 2017 at 1:12 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Hi,
>
> FYI, I'm in touch with Microsoft Azure team about that.
>
> We are testing the ADLS support via HDFS.
>
> I keep you posted.
>
> Regards
> JB
>
> On 11/22/2017 09:12 AM, Milan Chandna wrote:
>
>> Hi,
>>
>> Has anyone tried IO from(to) ADLS account on Beam with Spark runner?
>> I was trying recently to do this but was unable to make it work.
>>
>> Steps that I tried:
>>
>>    1.  Took HDI + Spark 1.6 cluster with default storage as ADLS account.
>>    2.  Built Apache Beam on that. Built to include Beam-2790<
>> https://issues.apache.org/jira/browse/BEAM-2790> fix which earlier I was
>> facing for ADL as well.
>>    3.  Modified WordCount.java example to use HadoopFileSystemOptions
>>    4.  Since HDI + Spark cluster has ADLS as defaultFS, tried 2 things
>>       *   Just gave the input path and output path as
>> adl://home/sample.txt and adl://home/output
>>       *   In addition to adl input and output path, also gave required
>> HDFS configuration with adl required configs as well.
>>
>> Both didn't worked btw.
>> s
>>    1.  Have checked ACL's and permissions. In fact similar job with same
>> paths work on Spark directly.
>>    2.  Issues faced:
>>       *   For input, Beam is not able to find the path. Console log:
>> Filepattern adl://home/sample.txt matched 0 files with total size 0
>>       *   Output path always gets converted to relative path, something
>> like this: /home/user1/adl:/home/output/.tmp....
>>
>>
>>
>>
>>
>> Debugging more into this but was checking if someone is already facing
>> this and has some resolution.
>>
>>
>>
>> Here is a sample code and command I used.
>>
>>
>>
>>      HadoopFileSystemOptions options = PipelineOptionsFactory.fromArg
>> s(args).as(HadoopFileSystemOptions.class);
>>
>>      Pipeline p = Pipeline.create(options);
>>
>>      p.apply( TextIO.read().from(options.get
>> HdfsConfiguration().get(0).get("fs.defaultFS")))
>>
>>       .apply(new CountWords())
>>
>>       .apply(MapElements.via(new FormatAsTextFn()))
>>
>>       .apply(TextIO.write().to("adl://home/output"));
>>
>>      p.run().waitUntilFinish();
>>
>>
>>
>>
>>
>> spark-submit --class org.apache.beam.examples.WordCount --master local
>> beam-examples-java-2.3.0-SNAPSHOT.jar --runner=SparkRunner
>> --hdfsConfiguration='[{\"fs.defaultFS\": \"hdfs://home/sample.txt\"]'
>>
>>
>>
>>
>>
>> P.S: Created fat jar to use with spark just for testing. Is there any
>> other correct way of running it with Spark runner?
>>
>>
>>
>> -Milan.
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Azure(ADLS) compatibility on Beam with Spark runner

Reply via email to