Andrew Kerr created TOREE-349:

             Summary: ClassCastException when reading Avro from another thread 
(Toree master / Spark 2.0.0)
                 Key: TOREE-349
             Project: TOREE
          Issue Type: Bug
            Reporter: Andrew Kerr

When using Toree (master branch commit e8ecd0623c65ad104045b1797fb27f69b8dfc23f)
with `--packages=com.databricks:spark-avro_2.11:3.0.1` in `SPARK_OPTS`
and attempting to load an avro file into a dataframe *in a separate thread*
then an exception is thrown
com.databricks.spark.avro.DefaultSource$SerializableConfiguration cannot be 
cast to com.databricks.spark.avro.DefaultSource$SerializableConfiguration`

Will attach a Jupyter notebook that illustrates the problem and includes full
stack trace, with a script showing environment.

The class that throws the exception `DefaultSource` broadcasts Hadoop config
and returns an anonymous function that accesses that config. The exception
occurs when that function is executed and it attempts to access the config.

This looks like a class loader mismatch problem to me ("Class Identity Crisis").
With a bit of hacking of `spark-avro` I've seen the class loader for 
`DefaultSource` when the config is broadcast to be 
and when the config is read to be

If a fat jar including `spark-avro` is built and included with `--jars=...`
then the same problem occurs.

Interestingly the Spark's included support for CSV uses the same pattern as
Avro, broadcasting a config, but works as expected as shown in the notebook.

Avro also works as expected when an application fat jar is built and passed to 
`spark-submit` without involving Toree.

Therefore this problem appears to require:

* Toree
* Broadcast variable
* Library added using `--packages` or `--jars`
* Library accessed from a thread different other than Toree interpreter's

This message was sent by Atlassian JIRA

Reply via email to