[ 
https://issues.apache.org/jira/browse/SPARK-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240201#comment-14240201
 ] 

Clay Kim commented on SPARK-3368:
---------------------------------

I got this to work by including avro-mapred with classifier for hadoop2: 
"org.apache.avro" % "avro-mapred" % "1.7.6" classifier "hadoop2", since hive 
includes a dependency of avro-mapred with hadoop2. 

Also, I used Spark built against Hadoop1: spark-1.1.1-bin-hadoop1

I've included here: 
https://github.com/theclaymethod/spark-parquet-thrift-example

> Spark cannot be used with Avro and Parquet
> ------------------------------------------
>
>                 Key: SPARK-3368
>                 URL: https://issues.apache.org/jira/browse/SPARK-3368
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.0.2
>            Reporter: Graham Dennis
>
> Spark cannot currently (as of 1.0.2) use any Parquet write support classes 
> that are not part of the spark assembly jar (at least when launched using 
> `spark-submit`).  This prevents using Avro with Parquet.
> See https://github.com/GrahamDennis/spark-avro-parquet for a test case to 
> reproduce this issue.
> The problem appears in the master logs as:
> {noformat}
>     14/09/03 17:31:10 ERROR Executor: Exception in task ID 0
>     parquet.hadoop.BadConfigurationException: could not instanciate class 
> parquet.avro.AvroWriteSupport set in job conf at parquet.write.support.class
>       at 
> parquet.hadoop.ParquetOutputFormat.getWriteSupportClass(ParquetOutputFormat.java:121)
>       at 
> parquet.hadoop.ParquetOutputFormat.getWriteSupport(ParquetOutputFormat.java:302)
>       at 
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
>       at 
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
>       at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:714)
>       at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:699)
>       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
>       at org.apache.spark.scheduler.Task.run(Task.scala:51)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
>     Caused by: java.lang.ClassNotFoundException: parquet.avro.AvroWriteSupport
>       at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>       at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>       at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>       at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>       at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>       at java.lang.Class.forName0(Native Method)
>       at java.lang.Class.forName(Class.java:190)
>       at 
> parquet.hadoop.ParquetOutputFormat.getWriteSupportClass(ParquetOutputFormat.java:115)
>       ... 11 more
> {noformat}
> The root cause of the problem is that the class loader that's used to find 
> the Parquet write support class only searches the spark assembly jar and 
> doesn't also search the application jar.  A solution would be to ensure that 
> the application jar is always available on the executor classpath.  This is 
> the same underlying issue as SPARK-2878, and SPARK-3166



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to