Andrew Kerr created TOREE-349: --------------------------------- Summary: ClassCastException when reading Avro from another thread (Toree master / Spark 2.0.0) Key: TOREE-349 URL: https://issues.apache.org/jira/browse/TOREE-349 Project: TOREE Issue Type: Bug Reporter: Andrew Kerr
When using Toree (master branch commit e8ecd0623c65ad104045b1797fb27f69b8dfc23f) with `--packages=com.databricks:spark-avro_2.11:3.0.1` in `SPARK_OPTS` and attempting to load an avro file into a dataframe *in a separate thread* then an exception is thrown `java.lang.ClassCastException: com.databricks.spark.avro.DefaultSource$SerializableConfiguration cannot be cast to com.databricks.spark.avro.DefaultSource$SerializableConfiguration` here https://github.com/databricks/spark-avro/blob/v3.0.1/src/main/scala/com/databricks/spark/avro/DefaultSource.scala#L156 Will attach a Jupyter notebook that illustrates the problem and includes full stack trace, with a script showing environment. The class that throws the exception `DefaultSource` broadcasts Hadoop config and returns an anonymous function that accesses that config. The exception occurs when that function is executed and it attempts to access the config. This looks like a class loader mismatch problem to me ("Class Identity Crisis"). With a bit of hacking of `spark-avro` I've seen the class loader for `DefaultSource` when the config is broadcast to be `scala.reflect.internal.util.ScalaClassLoader$URLClassLoader@31ac5411` and when the config is read to be `org.apache.spark.util.MutableURLClassLoader@3d3fcdb0` If a fat jar including `spark-avro` is built and included with `--jars=...` then the same problem occurs. Interestingly the Spark's included support for CSV uses the same pattern as Avro, broadcasting a config, but works as expected as shown in the notebook. https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L108 Avro also works as expected when an application fat jar is built and passed to `spark-submit` without involving Toree. Therefore this problem appears to require: * Toree * Broadcast variable * Library added using `--packages` or `--jars` * Library accessed from a thread different other than Toree interpreter's -- This message was sent by Atlassian JIRA (v6.3.4#6332)