[ 
https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14156581#comment-14156581
 ] 

Sean Owen commented on SPARK-3764:
----------------------------------

The artifacts themselves don't contain any Hadoop code. The default disposition 
of the pom would link to Hadoop 1, but apps are not meant to depend on this 
(this is generally good Maven practice). Yes, you always need to add Hadoop 
dependencies if you use Hadoop APIs. That's not specific to Spark.

In fact, you will want to mark Spark and Hadoop as "provided" dependencies when 
making an app for use with spark-submit. You can use the Spark artifacts to 
build a Spark app that works with Hadoop 2 or Hadoop 1.

The instructions you see are really about creating a build of Spark itself to 
deploy on a cluster, rather than an app for Spark.

> Invalid dependencies of artifacts in Maven Central Repository.
> --------------------------------------------------------------
>
>                 Key: SPARK-3764
>                 URL: https://issues.apache.org/jira/browse/SPARK-3764
>             Project: Spark
>          Issue Type: Bug
>          Components: Build
>    Affects Versions: 1.1.0
>            Reporter: Takuya Ueshin
>
> While testing my spark applications locally using spark artifacts downloaded 
> from Maven Central, the following exception was thrown:
> {quote}
> ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread 
> Thread[Executor task launch worker-2,5,main]
> java.lang.IncompatibleClassChangeError: Found class 
> org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
>       at 
> org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334)
>       at 
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251)
>       at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300)
>       at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
>       at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
>       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>       at org.apache.spark.scheduler.Task.run(Task.scala:54)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745)
> {quote}
> This is because the hadoop class {{TaskAttemptContext}} is incompatible 
> between hadoop-1 and hadoop-2.
> I guess the spark artifacts in Maven Central were built against hadoop-2 with 
> Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so 
> the hadoop version mismatch is happend.
> FYI:
> sbt seems to publish 'effective pom'-like pom file, so the dependencies are 
> correctly resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to