[jira] [Comment Edited] (HADOOP-15559) Clarity on Spark compatibility with hadoop-aws

Steve Loughran (JIRA) Tue, 26 Jun 2018 06:53:04 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-15559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523741#comment-16523741
 ]


Steve Loughran edited comment on HADOOP-15559 at 6/26/18 1:51 PM:
------------------------------------------------------------------

# We feel your pain. Getting everything synced up is hard, especially when 
Spark itself bumps up some dependencies incompatibly (SPARK-22919)
 # the latest docs on this topic [are 
here|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md];
 as they say

{quote}Critical: Do not attempt to "drop in" a newer version of the AWS SDK 
than that which the Hadoop version was built with Whatever problem you have, 
changing the AWS SDK version will not fix things, only change the stack traces 
you see.
{quote}
{quote}Similarly, don't try and mix a hadoop-aws JAR from one Hadoop release 
with that of any other. The JAR must be in sync with hadoop-common and some 
other Hadoop JARs.
{quote}
{quote}Randomly changing hadoop- and aws- JARs in the hope of making a problem 
"go away" or to gain access to a feature you want, will not lead to the outcome 
you desire.
{quote}

We also point people at 
[mvnepo|http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws] for the 
normative list of hadoop-aws - aws JAR mapping. Getting an AWS SDK and Hadoop 
AWS binding together is not easy, and if you do try it, you are on your own, 
with nothing but JIRAs related to "upgrade AWS SDK" to act as a cue.

Where life is hard is that unless you build spark with the -Phadoop-cloud 
profile, you don't get things all lined up. It is what it is for. 

Regarding your specific issue, unless you are using a release of spark with 
that cloud profile, you have to do it by hand. 
* Get the exact matching hadoop-aws JAR as hadoop-common. Same for hadoop-auth, 
hadoop-aws. You cannot mix them.
* get the matching aws SDK JAR(s), using mvnrepo as your guide.
* And jackson, obviously. FWIW Hadoop 2.9+ has moved to the shaded AWS SDK JAR 
to avoid a lot of this pain.
* If you want to use Hadoop 2.8, unless your spark distribution has reverted 
SPARK-22919, downgrade the httpclient libraries (see the PR there for what 
changed)

Returning to your complaint. anything else you can do the docs are welcome, 
though really, the best strategy would be to get spark releases built with that 
hadoop-cloud profile, which is intended to give you all the dependencies you 
need, and none of the ones you don't.


was (Author: ste...@apache.org):
# We feel your pain. Getting everything synced up is hard, especially when 
Spark itself bumps up some dependencies incompatibly (SPARK-22919)
 # the latest docs on this topic [are 
here|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md];
 as they say

{quote}Critical: Do not attempt to "drop in" a newer version of the AWS SDK 
than that which the Hadoop version was built with Whatever problem you have, 
changing the AWS SDK version will not fix things, only change the stack traces 
you see.
{quote}
{quote}Similarly, don't try and mix a hadoop-aws JAR from one Hadoop release 
with that of any other. The JAR must be in sync with hadoop-common and some 
other Hadoop JARs.
{quote}
{quote}Randomly changing hadoop- and aws- JARs in the hope of making a problem 
"go away" or to gain access to a feature you want, will not lead to the outcome 
you desire.
{quote}

We also point people at 
[mvnepo|http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws] for the 
normative list of hadoop-aws - aws JAR mapping. Getting an AWS SDK and Hadoop 
AWS binding together is not easy, and if you do try it, you are on your own, 
with nothing but JIRAs related to "upgrade AWS SDK" to act as a cue.

Where life is hard is that unless you build spark with the -Phadoop-cloud 
profile, you don't get things all lined up. It is what it is for. 

Regarding your specific issue, unless you are using a release of spark with 
that cloud profile, you have to do it by hand. 
* Get the exact matching hadoop-aws JAR as hadoop-common. Same for hadoop-auth, 
hadoop-aws. You cannot mix them.
* get the matching aws SDK JAR(s), using mvnrepo as your guide.
* And jackson, obviously. FWIW Hadoop 2.9+ has moved to the shaded AWS SDK JAR 
to avoid a lot of this pain.
* If you want to use Hadoop 2.8, unless your spark distribution has reverted 
SPARK-22919, downgrade the httpclient libraries (see the PR there for what 
changed)



Returning you complaint. anything else you can do the docs are welcome, though 
really, the best strategy would be to get spark releases built with that 
hadoop-cloud profile, which is intended to give you all the dependencies you 
need, and none of the ones you don't.

> Clarity on Spark compatibility with hadoop-aws
> ----------------------------------------------
>
>                 Key: HADOOP-15559
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15559
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: documentation, fs/s3
>            Reporter: Nicholas Chammas
>            Priority: Minor
>
> I'm the maintainer of [Flintrock|https://github.com/nchammas/flintrock], a 
> command-line tool for launching Apache Spark clusters on AWS. One of the 
> things I try to do for my users is make it straightforward to use Spark with 
> {{s3a://}}. I do this by recommending that users start Spark with the 
> {{hadoop-aws}} package.
> For example:
> {code:java}
> pyspark --packages "org.apache.hadoop:hadoop-aws:2.8.4"
> {code}
> I'm struggling, however, to understand what versions of {{hadoop-aws}} should 
> work with what versions of Spark.
> Spark releases are [built against Hadoop 
> 2.7|http://archive.apache.org/dist/spark/spark-2.3.1/]. At the same time, 
> I've been told that I should be able to use newer versions of Hadoop and 
> Hadoop libraries with Spark, so for example, running Spark built against 
> Hadoop 2.7 alongside HDFS 2.8 should work, and there is [no need to build 
> Spark explicitly against Hadoop 
> 2.8|http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-3-1-RC4-tp24087p24092.html].
> I'm having trouble translating this mental model into recommendations for how 
> to pair Spark with {{hadoop-aws}}.
> For example, Spark 2.3.1 built against Hadoop 2.7 works with 
> {{hadoop-aws:2.7.6}} but not with {{hadoop-aws:2.8.4}}. Trying the latter 
> yields the following error when I try to access files via {{s3a://}}.
> {code:java}
> py4j.protocol.Py4JJavaError: An error occurred while calling o35.text.
> : java.lang.IllegalAccessError: tried to access method 
> org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V
>  from class org.apache.hadoop.fs.s3a.S3AInstrumentation
> at 
> org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:194)
> at 
> org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
> at 
> org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:139)
> at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
> at 
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
> at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
> at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:693)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:282)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:238)
> at java.lang.Thread.run(Thread.java:748){code}
> So it would seem that {{hadoop-aws}} must be matched to the same MAJOR.MINOR 
> release of Hadoop that Spark is built against. However, neither [this 
> page|https://wiki.apache.org/hadoop/AmazonS3] nor [this 
> one|https://hortonworks.github.io/hdp-aws/s3-spark/] shed any light on how to 
> pair the correct version of {{hadoop-aws}} with Spark.
> Would it be appropriate to add some guidance somewhere on what versions of 
> {{hadoop-aws}} work with what versions and builds of Spark? It would help 
> eliminate this kind of guesswork and slow spelunking.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HADOOP-15559) Clarity on Spark compatibility with hadoop-aws

Reply via email to