[jira] [Commented] (TOREE-349) ClassCastException when reading Avro from another thread (Toree master / Spark 2.0.0)
[ https://issues.apache.org/jira/browse/TOREE-349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15664355#comment-15664355 ] Phil Berkland commented on TOREE-349: - Please look at solution to https://issues.apache.org/jira/browse/TOREE-351, which is probably the same underlying issue. > ClassCastException when reading Avro from another thread (Toree master / > Spark 2.0.0) > - > > Key: TOREE-349 > URL: https://issues.apache.org/jira/browse/TOREE-349 > Project: TOREE > Issue Type: Bug >Reporter: Andrew Kerr > Attachments: avro-csv-addDeps.scala.ipynb, > avro-csv-threading.scala.ipynb, avro-csv-threading.scala.ipynb, run.sh > > > When using Toree (master branch commit > e8ecd0623c65ad104045b1797fb27f69b8dfc23f) > with `--packages=com.databricks:spark-avro_2.11:3.0.1` in `SPARK_OPTS` > and attempting to load an avro file into a dataframe *in a separate thread* > then an exception is thrown > `java.lang.ClassCastException: > com.databricks.spark.avro.DefaultSource$SerializableConfiguration cannot be > cast to com.databricks.spark.avro.DefaultSource$SerializableConfiguration` > here > https://github.com/databricks/spark-avro/blob/v3.0.1/src/main/scala/com/databricks/spark/avro/DefaultSource.scala#L156 > Will attach a Jupyter notebook that illustrates the problem and includes full > stack trace, with a script showing environment. > The class that throws the exception `DefaultSource` broadcasts Hadoop config > and returns an anonymous function that accesses that config. The exception > occurs when that function is executed and it attempts to access the config. > This looks like a class loader mismatch problem to me ("Class Identity > Crisis"). > With a bit of hacking of `spark-avro` I've seen the class loader for > `DefaultSource` when the config is broadcast to be > `scala.reflect.internal.util.ScalaClassLoader$URLClassLoader@31ac5411` > and when the config is read to be > `org.apache.spark.util.MutableURLClassLoader@3d3fcdb0` > If a fat jar including `spark-avro` is built and included with `--jars=...` > then the same problem occurs. > Interestingly the Spark's included support for CSV uses the same pattern as > Avro, broadcasting a config, but works as expected as shown in the notebook. > https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L108 > Avro also works as expected when an application fat jar is built and passed > to > `spark-submit` without involving Toree. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TOREE-349) ClassCastException when reading Avro from another thread (Toree master / Spark 2.0.0)
[ https://issues.apache.org/jira/browse/TOREE-349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15585210#comment-15585210 ] Andrew Kerr commented on TOREE-349: --- Apparently not, no. The below works fine in spark-shell. {code:language=scala} import scala.concurrent.duration.Duration import scala.concurrent.{Await, Future} import scala.concurrent.ExecutionContext.Implicits.global import org.apache.spark.sql.SparkSession import org.apache.spark.sql.SaveMode import com.databricks.spark.avro._ val session = SparkSession.builder().getOrCreate() import session.implicits._ val dataframe = sc.parallelize(1 to 10).toDF dataframe.show() dataframe.write.mode(SaveMode.Overwrite).csv("csv") dataframe.write.mode(SaveMode.Overwrite).avro("avro") val future = Future(session.read.csv("csv")) val result = Await.result(future, Duration.Inf) result.show() val future = Future(session.read.avro("avro")) val result = Await.result(future, Duration.Inf) result.show() {code} > ClassCastException when reading Avro from another thread (Toree master / > Spark 2.0.0) > - > > Key: TOREE-349 > URL: https://issues.apache.org/jira/browse/TOREE-349 > Project: TOREE > Issue Type: Bug >Reporter: Andrew Kerr > Attachments: avro-csv-threading.scala.ipynb, run.sh > > > When using Toree (master branch commit > e8ecd0623c65ad104045b1797fb27f69b8dfc23f) > with `--packages=com.databricks:spark-avro_2.11:3.0.1` in `SPARK_OPTS` > and attempting to load an avro file into a dataframe *in a separate thread* > then an exception is thrown > `java.lang.ClassCastException: > com.databricks.spark.avro.DefaultSource$SerializableConfiguration cannot be > cast to com.databricks.spark.avro.DefaultSource$SerializableConfiguration` > here > https://github.com/databricks/spark-avro/blob/v3.0.1/src/main/scala/com/databricks/spark/avro/DefaultSource.scala#L156 > Will attach a Jupyter notebook that illustrates the problem and includes full > stack trace, with a script showing environment. > The class that throws the exception `DefaultSource` broadcasts Hadoop config > and returns an anonymous function that accesses that config. The exception > occurs when that function is executed and it attempts to access the config. > This looks like a class loader mismatch problem to me ("Class Identity > Crisis"). > With a bit of hacking of `spark-avro` I've seen the class loader for > `DefaultSource` when the config is broadcast to be > `scala.reflect.internal.util.ScalaClassLoader$URLClassLoader@31ac5411` > and when the config is read to be > `org.apache.spark.util.MutableURLClassLoader@3d3fcdb0` > If a fat jar including `spark-avro` is built and included with `--jars=...` > then the same problem occurs. > Interestingly the Spark's included support for CSV uses the same pattern as > Avro, broadcasting a config, but works as expected as shown in the notebook. > https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L108 > Avro also works as expected when an application fat jar is built and passed > to > `spark-submit` without involving Toree. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TOREE-349) ClassCastException when reading Avro from another thread (Toree master / Spark 2.0.0)
[ https://issues.apache.org/jira/browse/TOREE-349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583801#comment-15583801 ] Marius Van Niekerk commented on TOREE-349: -- So do you need to do the same thing for spark-shell? > ClassCastException when reading Avro from another thread (Toree master / > Spark 2.0.0) > - > > Key: TOREE-349 > URL: https://issues.apache.org/jira/browse/TOREE-349 > Project: TOREE > Issue Type: Bug >Reporter: Andrew Kerr > Attachments: avro-csv-threading.scala.ipynb, run.sh > > > When using Toree (master branch commit > e8ecd0623c65ad104045b1797fb27f69b8dfc23f) > with `--packages=com.databricks:spark-avro_2.11:3.0.1` in `SPARK_OPTS` > and attempting to load an avro file into a dataframe *in a separate thread* > then an exception is thrown > `java.lang.ClassCastException: > com.databricks.spark.avro.DefaultSource$SerializableConfiguration cannot be > cast to com.databricks.spark.avro.DefaultSource$SerializableConfiguration` > here > https://github.com/databricks/spark-avro/blob/v3.0.1/src/main/scala/com/databricks/spark/avro/DefaultSource.scala#L156 > Will attach a Jupyter notebook that illustrates the problem and includes full > stack trace, with a script showing environment. > The class that throws the exception `DefaultSource` broadcasts Hadoop config > and returns an anonymous function that accesses that config. The exception > occurs when that function is executed and it attempts to access the config. > This looks like a class loader mismatch problem to me ("Class Identity > Crisis"). > With a bit of hacking of `spark-avro` I've seen the class loader for > `DefaultSource` when the config is broadcast to be > `scala.reflect.internal.util.ScalaClassLoader$URLClassLoader@31ac5411` > and when the config is read to be > `org.apache.spark.util.MutableURLClassLoader@3d3fcdb0` > If a fat jar including `spark-avro` is built and included with `--jars=...` > then the same problem occurs. > Interestingly the Spark's included support for CSV uses the same pattern as > Avro, broadcasting a config, but works as expected as shown in the notebook. > https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L108 > Avro also works as expected when an application fat jar is built and passed > to > `spark-submit` without involving Toree. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TOREE-349) ClassCastException when reading Avro from another thread (Toree master / Spark 2.0.0)
[ https://issues.apache.org/jira/browse/TOREE-349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15582769#comment-15582769 ] Andrew Kerr commented on TOREE-349: --- This code works as expected: ``` val classLoader = Thread.currentThread().getContextClassLoader println(classLoader) val future = Future{ Thread.currentThread().setContextClassLoader(classLoader) session.read.avro("foo") } val result = Await.result(future, Duration.Inf) result.show() ``` The classloader is `scala.tools.nsc.interpreter.IMain$TranslatingClassLoader@864ff30` Obviously this isn't ideal. It also isn't necessary for loading CSV files, which are implemented in a similar way to the Avro loader (as in the Avro code looks copy-pasted from CSV). > ClassCastException when reading Avro from another thread (Toree master / > Spark 2.0.0) > - > > Key: TOREE-349 > URL: https://issues.apache.org/jira/browse/TOREE-349 > Project: TOREE > Issue Type: Bug >Reporter: Andrew Kerr > Attachments: avro-csv-threading.scala.ipynb, run.sh > > > When using Toree (master branch commit > e8ecd0623c65ad104045b1797fb27f69b8dfc23f) > with `--packages=com.databricks:spark-avro_2.11:3.0.1` in `SPARK_OPTS` > and attempting to load an avro file into a dataframe *in a separate thread* > then an exception is thrown > `java.lang.ClassCastException: > com.databricks.spark.avro.DefaultSource$SerializableConfiguration cannot be > cast to com.databricks.spark.avro.DefaultSource$SerializableConfiguration` > here > https://github.com/databricks/spark-avro/blob/v3.0.1/src/main/scala/com/databricks/spark/avro/DefaultSource.scala#L156 > Will attach a Jupyter notebook that illustrates the problem and includes full > stack trace, with a script showing environment. > The class that throws the exception `DefaultSource` broadcasts Hadoop config > and returns an anonymous function that accesses that config. The exception > occurs when that function is executed and it attempts to access the config. > This looks like a class loader mismatch problem to me ("Class Identity > Crisis"). > With a bit of hacking of `spark-avro` I've seen the class loader for > `DefaultSource` when the config is broadcast to be > `scala.reflect.internal.util.ScalaClassLoader$URLClassLoader@31ac5411` > and when the config is read to be > `org.apache.spark.util.MutableURLClassLoader@3d3fcdb0` > If a fat jar including `spark-avro` is built and included with `--jars=...` > then the same problem occurs. > Interestingly the Spark's included support for CSV uses the same pattern as > Avro, broadcasting a config, but works as expected as shown in the notebook. > https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L108 > Avro also works as expected when an application fat jar is built and passed > to > `spark-submit` without involving Toree. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TOREE-349) ClassCastException when reading Avro from another thread (Toree master / Spark 2.0.0)
[ https://issues.apache.org/jira/browse/TOREE-349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575398#comment-15575398 ] Marius Van Niekerk commented on TOREE-349: -- So for your separate thread do you set the contextClassLoader explicitly ? You might have to do that to ensure that you use the proper classloader. (the org.apache.spark one) > ClassCastException when reading Avro from another thread (Toree master / > Spark 2.0.0) > - > > Key: TOREE-349 > URL: https://issues.apache.org/jira/browse/TOREE-349 > Project: TOREE > Issue Type: Bug >Reporter: Andrew Kerr > Attachments: avro-csv-threading.scala.ipynb, run.sh > > > When using Toree (master branch commit > e8ecd0623c65ad104045b1797fb27f69b8dfc23f) > with `--packages=com.databricks:spark-avro_2.11:3.0.1` in `SPARK_OPTS` > and attempting to load an avro file into a dataframe *in a separate thread* > then an exception is thrown > `java.lang.ClassCastException: > com.databricks.spark.avro.DefaultSource$SerializableConfiguration cannot be > cast to com.databricks.spark.avro.DefaultSource$SerializableConfiguration` > here > https://github.com/databricks/spark-avro/blob/v3.0.1/src/main/scala/com/databricks/spark/avro/DefaultSource.scala#L156 > Will attach a Jupyter notebook that illustrates the problem and includes full > stack trace, with a script showing environment. > The class that throws the exception `DefaultSource` broadcasts Hadoop config > and returns an anonymous function that accesses that config. The exception > occurs when that function is executed and it attempts to access the config. > This looks like a class loader mismatch problem to me ("Class Identity > Crisis"). > With a bit of hacking of `spark-avro` I've seen the class loader for > `DefaultSource` when the config is broadcast to be > `scala.reflect.internal.util.ScalaClassLoader$URLClassLoader@31ac5411` > and when the config is read to be > `org.apache.spark.util.MutableURLClassLoader@3d3fcdb0` > If a fat jar including `spark-avro` is built and included with `--jars=...` > then the same problem occurs. > Interestingly the Spark's included support for CSV uses the same pattern as > Avro, broadcasting a config, but works as expected as shown in the notebook. > https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L108 > Avro also works as expected when an application fat jar is built and passed > to > `spark-submit` without involving Toree. -- This message was sent by Atlassian JIRA (v6.3.4#6332)