Nicholas Chammas created HADOOP-15559:
-----------------------------------------
Summary: Clarity on Spark compatibility with hadoop-aws
Key: HADOOP-15559
URL: https://issues.apache.org/jira/browse/HADOOP-15559
Project: Hadoop Common
Issue Type: Improvement
Components: documentation, fs/s3
Reporter: Nicholas Chammas
I'm the maintainer of [Flintrock|https://github.com/nchammas/flintrock], a
command-line tool for launching Apache Spark clusters on AWS. One of the things
I try to do for my users is make it straightforward to use Spark with
{{s3a://}}. I do this by recommending that users start Spark with the
{{hadoop-aws}} package.
For example:
{code:java}
pyspark --packages "org.apache.hadoop:hadoop-aws:2.8.4"
{code}
I'm struggling, however, to understand what versions of {{hadoop-aws}} should
work with what versions of Spark.
Spark releases are [built against Hadoop
2.7|http://archive.apache.org/dist/spark/spark-2.3.1/]. At the same time, I've
been told that I should be able to use newer versions of Hadoop and Hadoop
libraries with Spark, so for example, running Spark built against Hadoop 2.7
alongside HDFS 2.8 should work, and there is [no need to build Spark explicitly
against Hadoop
2.8|http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-3-1-RC4-tp24087p24092.html].
I'm having trouble translating this mental model into recommendations for how
to pair Spark with {{hadoop-aws}}.
For example, Spark 2.3.1 built against Hadoop 2.7 works with
{{hadoop-aws:2.7.6}} but not with {{hadoop-aws:2.8.4}}. Trying the latter
yields the following error when I try to access files via {{s3a://}}.
{code:java}
py4j.protocol.Py4JJavaError: An error occurred while calling o35.text.
: java.lang.IllegalAccessError: tried to access method
org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V
from class org.apache.hadoop.fs.s3a.S3AInstrumentation
at
org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:194)
at
org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
at
org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:139)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at
org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:693)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748){code}
So it would seem that {{hadoop-aws}} must be matched to the same MAJOR.MINOR
release of Hadoop that Spark is built against. However, neither [this
page|https://wiki.apache.org/hadoop/AmazonS3] nor [this
one|https://hortonworks.github.io/hdp-aws/s3-spark/] shed any light on how to
pair the correct version of {{hadoop-aws}} with Spark.
Would it be appropriate to add some guidance somewhere on what versions of
{{hadoop-aws}} work with what versions and builds of Spark? It would help
eliminate this kind of guesswork and slow spelunking.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]