[
https://issues.apache.org/jira/browse/SPARK-27623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16878896#comment-16878896
]
Paul Wais commented on SPARK-27623:
-----------------------------------
I don't think this issue stems from a particular setup, but rather the public
package seems broken.
See stack overflow:
*
[https://stackoverflow.com/questions/55873023/built-in-spark-avro-unable-to-read-avro-file-from-shell]
* [https://stackoverflow.com/questions/53715347/spark-reading-avro-file]
This is also broken for me in spark 2.4.3. Repro:
docker run --rm -it au2018/env:v1.5.1-draft bash
# Add org.apache.spark:spark-avro_2.12:2.4.3 :
$ vim /opt/spark/conf/spark-defaults.conf
# Run demo
$ python3
Python 3.6.8 (default, Jan 14 2019, 11:02:34)
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import findspark
>>> findspark.init()
>>> import pyspark
>>> import pyspark.sql
>>> spark = pyspark.sql.SparkSession.builder.getOrCreate()
...
databricks#spark-deep-learning added as a dependency
databricks#tensorframes added as a dependency
org.apache.spark#spark-avro_2.12 added as a dependency
:: resolving dependencies ::
org.apache.spark#spark-submit-parent-4ba8057c-a59d-4005-ae5b-5ac5e3b9d91d;1.0
...
found org.spark-project.spark#unused;1.0.0 in central
downloading
https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12/2.4.3/spark-avro_2.12-2.4.3.jar
...
[SUCCESSFUL ]
org.apache.spark#spark-avro_2.12;2.4.3!spark-avro_2.12.jar (107ms)
downloading
https://repo1.maven.org/maven2/org/spark-project/spark/unused/1.0.0/unused-1.0.0.jar
...
[SUCCESSFUL ] org.spark-project.spark#unused;1.0.0!unused.jar (32ms)
...
>>> df =
>>> spark.read.format("avro").load("/opt/spark/examples/src/main/resources/users.avro")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/spark/python/pyspark/sql/readwriter.py", line 166, in load
return self._df(self._jreader.load(path))
File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line
1257, in __call__
File "/opt/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328,
in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o28.load.
: java.util.ServiceConfigurationError:
org.apache.spark.sql.sources.DataSourceRegister: Provider
org.apache.spark.sql.avro.AvroFileFormat could not be instantiated
at java.util.ServiceLoader.fail(ServiceLoader.java:232)
at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
at
java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at
scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at
scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
at
scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
at scala.collection.AbstractTraversable.filter(Traversable.scala:104)
at
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:630)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoSuchMethodError:
org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)V
at
org.apache.spark.sql.avro.AvroFileFormat.<init>(AvroFileFormat.scala:44)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.lang.Class.newInstance(Class.java:442)
at
java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
... 24 more
> Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated
> ---------------------------------------------------------------------------
>
> Key: SPARK-27623
> URL: https://issues.apache.org/jira/browse/SPARK-27623
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.4.2
> Reporter: Alexandru Barbulescu
> Priority: Major
>
> After updating to spark 2.4.2 when using the
> {code:java}
> spark.read.format().options().load()
> {code}
>
> chain of methods, regardless of what parameter is passed to "format" we get
> the following error related to avro:
>
> {code:java}
> - .options(**load_options)
> - File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line
> 172, in load
> - File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line
> 1257, in __call__
> - File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in
> deco
> - File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line
> 328, in get_return_value
> - py4j.protocol.Py4JJavaError: An error occurred while calling o69.load.
> - : java.util.ServiceConfigurationError:
> org.apache.spark.sql.sources.DataSourceRegister: Provider
> org.apache.spark.sql.avro.AvroFileFormat could not be instantiated
> - at java.util.ServiceLoader.fail(ServiceLoader.java:232)
> - at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
> - at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
> - at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
> - at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
> - at
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44)
> - at scala.collection.Iterator.foreach(Iterator.scala:941)
> - at scala.collection.Iterator.foreach$(Iterator.scala:941)
> - at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
> - at scala.collection.IterableLike.foreach(IterableLike.scala:74)
> - at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> - at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> - at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:250)
> - at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:248)
> - at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
> - at scala.collection.TraversableLike.filter(TraversableLike.scala:262)
> - at scala.collection.TraversableLike.filter$(TraversableLike.scala:262)
> - at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
> - at
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:630)
> - at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
> - at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
> - at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> - at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> - at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> - at java.lang.reflect.Method.invoke(Method.java:498)
> - at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> - at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> - at py4j.Gateway.invoke(Gateway.java:282)
> - at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> - at py4j.commands.CallCommand.execute(CallCommand.java:79)
> - at py4j.GatewayConnection.run(GatewayConnection.java:238)
> - at java.lang.Thread.run(Thread.java:748)
> - Caused by: java.lang.NoClassDefFoundError:
> org/apache/spark/sql/execution/datasources/FileFormat$class
> - at org.apache.spark.sql.avro.AvroFileFormat.<init>(AvroFileFormat.scala:44)
> - at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> - at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> - at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> - at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> - at java.lang.Class.newInstance(Class.java:442)
> - at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
> - ... 29 more
> - Caused by: java.lang.ClassNotFoundException:
> org.apache.spark.sql.execution.datasources.FileFormat$class
> - at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> - at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> - at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> - at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> - ... 36 more
> {code}
>
> The code we run looks like this:
>
> {code:java}
> spark_session = (
> SparkSession.builder
> .appName(APPLICATION_NAME)
> .master(MASTER_URL)
> .config('spark.cassandra.connection.host', SERVER_IP_ADDRESS)
> .config('spark.cassandra.auth.username', CASSANDRA_USERNAME)
> .config('spark.cassandra.auth.password', CASSANDRA_PASSWORD)
> .config('spark.sql.shuffle.partitions', 16)
> .config('parquet.enable.summary-metadata', 'true')
> .getOrCreate())
> load_options = {
> 'keyspace': CASSANDRA_KEYSPACE,
> 'table': TABLE_NAME,
> 'spark.cassandra.input.fetch.size_in_rows': '150' }
> df = (spark_session.read.format('org.apache.spark.sql.cassandra')
> .options(**load_options)
> .load())
> {code}
>
> We get the exact same error when trying to read a local .avro file instead of
> from Cassandra.
> Up to now we included the .jar file for Spark-Avro using the spark-submit
> --jars option. The version of Spark-Avro that we used, and worked with Spark
> 2.4.1, was Spark-Avro 2.4.0.
> In an attempt to fix this problem we tried updating the .jar file version. We
> also tried using the --packages option, with different version combinations,
> but none of these solutions worked. The same error shows up every time.
> When rolling back to Spark 2.4.1 with the exact same setup and code, the
> error doesn't show up and everything works fine.
> Any ideas on what could be causing this?
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]