[ https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14156581#comment-14156581 ]
Sean Owen commented on SPARK-3764: ---------------------------------- The artifacts themselves don't contain any Hadoop code. The default disposition of the pom would link to Hadoop 1, but apps are not meant to depend on this (this is generally good Maven practice). Yes, you always need to add Hadoop dependencies if you use Hadoop APIs. That's not specific to Spark. In fact, you will want to mark Spark and Hadoop as "provided" dependencies when making an app for use with spark-submit. You can use the Spark artifacts to build a Spark app that works with Hadoop 2 or Hadoop 1. The instructions you see are really about creating a build of Spark itself to deploy on a cluster, rather than an app for Spark. > Invalid dependencies of artifacts in Maven Central Repository. > -------------------------------------------------------------- > > Key: SPARK-3764 > URL: https://issues.apache.org/jira/browse/SPARK-3764 > Project: Spark > Issue Type: Bug > Components: Build > Affects Versions: 1.1.0 > Reporter: Takuya Ueshin > > While testing my spark applications locally using spark artifacts downloaded > from Maven Central, the following exception was thrown: > {quote} > ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread > Thread[Executor task launch worker-2,5,main] > java.lang.IncompatibleClassChangeError: Found class > org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected > at > org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334) > at > parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251) > at > org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300) > at > org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) > at > org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > at org.apache.spark.scheduler.Task.run(Task.scala:54) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {quote} > This is because the hadoop class {{TaskAttemptContext}} is incompatible > between hadoop-1 and hadoop-2. > I guess the spark artifacts in Maven Central were built against hadoop-2 with > Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so > the hadoop version mismatch is happend. > FYI: > sbt seems to publish 'effective pom'-like pom file, so the dependencies are > correctly resolved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org