[
https://issues.apache.org/jira/browse/SPARK-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240201#comment-14240201
]
Clay Kim commented on SPARK-3368:
---------------------------------
I got this to work by including avro-mapred with classifier for hadoop2:
"org.apache.avro" % "avro-mapred" % "1.7.6" classifier "hadoop2", since hive
includes a dependency of avro-mapred with hadoop2.
Also, I used Spark built against Hadoop1: spark-1.1.1-bin-hadoop1
I've included here:
https://github.com/theclaymethod/spark-parquet-thrift-example
> Spark cannot be used with Avro and Parquet
> ------------------------------------------
>
> Key: SPARK-3368
> URL: https://issues.apache.org/jira/browse/SPARK-3368
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.0.2
> Reporter: Graham Dennis
>
> Spark cannot currently (as of 1.0.2) use any Parquet write support classes
> that are not part of the spark assembly jar (at least when launched using
> `spark-submit`). This prevents using Avro with Parquet.
> See https://github.com/GrahamDennis/spark-avro-parquet for a test case to
> reproduce this issue.
> The problem appears in the master logs as:
> {noformat}
> 14/09/03 17:31:10 ERROR Executor: Exception in task ID 0
> parquet.hadoop.BadConfigurationException: could not instanciate class
> parquet.avro.AvroWriteSupport set in job conf at parquet.write.support.class
> at
> parquet.hadoop.ParquetOutputFormat.getWriteSupportClass(ParquetOutputFormat.java:121)
> at
> parquet.hadoop.ParquetOutputFormat.getWriteSupport(ParquetOutputFormat.java:302)
> at
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
> at
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:714)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:699)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassNotFoundException: parquet.avro.AvroWriteSupport
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:190)
> at
> parquet.hadoop.ParquetOutputFormat.getWriteSupportClass(ParquetOutputFormat.java:115)
> ... 11 more
> {noformat}
> The root cause of the problem is that the class loader that's used to find
> the Parquet write support class only searches the spark assembly jar and
> doesn't also search the application jar. A solution would be to ensure that
> the application jar is always available on the executor classpath. This is
> the same underlying issue as SPARK-2878, and SPARK-3166
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]