[ 
https://issues.apache.org/jira/browse/SPARK-27623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16878896#comment-16878896
 ] 

Paul Wais commented on SPARK-27623:
-----------------------------------

I don't think this issue stems from a particular setup, but rather the public 
package seems broken.

See stack overflow: 

 * 
[https://stackoverflow.com/questions/55873023/built-in-spark-avro-unable-to-read-avro-file-from-shell]
 * [https://stackoverflow.com/questions/53715347/spark-reading-avro-file]


This is also broken for me in spark 2.4.3.  Repro:

docker run --rm -it au2018/env:v1.5.1-draft bash

# Add org.apache.spark:spark-avro_2.12:2.4.3 :

$ vim /opt/spark/conf/spark-defaults.conf

# Run demo
$ python3
Python 3.6.8 (default, Jan 14 2019, 11:02:34)
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import findspark
>>> findspark.init()
>>> import pyspark
>>> import pyspark.sql
>>> spark = pyspark.sql.SparkSession.builder.getOrCreate()

...
databricks#spark-deep-learning added as a dependency
databricks#tensorframes added as a dependency
org.apache.spark#spark-avro_2.12 added as a dependency

:: resolving dependencies :: 
org.apache.spark#spark-submit-parent-4ba8057c-a59d-4005-ae5b-5ac5e3b9d91d;1.0
...
        found org.spark-project.spark#unused;1.0.0 in central
downloading 
https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12/2.4.3/spark-avro_2.12-2.4.3.jar
 ...
        [SUCCESSFUL ] 
org.apache.spark#spark-avro_2.12;2.4.3!spark-avro_2.12.jar (107ms)
downloading 
https://repo1.maven.org/maven2/org/spark-project/spark/unused/1.0.0/unused-1.0.0.jar
 ...
        [SUCCESSFUL ] org.spark-project.spark#unused;1.0.0!unused.jar (32ms)
...
>>> df = 
>>> spark.read.format("avro").load("/opt/spark/examples/src/main/resources/users.avro")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/spark/python/pyspark/sql/readwriter.py", line 166, in load
    return self._df(self._jreader.load(path))
  File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 
1257, in __call__
  File "/opt/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, 
in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o28.load.
: java.util.ServiceConfigurationError: 
org.apache.spark.sql.sources.DataSourceRegister: Provider 
org.apache.spark.sql.avro.AvroFileFormat could not be instantiated
        at java.util.ServiceLoader.fail(ServiceLoader.java:232)
        at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
        at 
java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
        at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
        at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
        at 
scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
        at scala.collection.Iterator$class.foreach(Iterator.scala:891)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
        at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
        at 
scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
        at 
scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
        at scala.collection.AbstractTraversable.filter(Traversable.scala:104)
        at 
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:630)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoSuchMethodError: 
org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)V
        at 
org.apache.spark.sql.avro.AvroFileFormat.<init>(AvroFileFormat.scala:44)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at java.lang.Class.newInstance(Class.java:442)
        at 
java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
        ... 24 more





 

 

> Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-27623
>                 URL: https://issues.apache.org/jira/browse/SPARK-27623
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.4.2
>            Reporter: Alexandru Barbulescu
>            Priority: Major
>
> After updating to spark 2.4.2 when using the 
> {code:java}
> spark.read.format().options().load()
> {code}
>  
> chain of methods, regardless of what parameter is passed to "format" we get 
> the following error related to avro:
>  
> {code:java}
> - .options(**load_options)
> - File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 
> 172, in load
> - File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 
> 1257, in __call__
> - File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in 
> deco
> - File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
> 328, in get_return_value
> - py4j.protocol.Py4JJavaError: An error occurred while calling o69.load.
> - : java.util.ServiceConfigurationError: 
> org.apache.spark.sql.sources.DataSourceRegister: Provider 
> org.apache.spark.sql.avro.AvroFileFormat could not be instantiated
> - at java.util.ServiceLoader.fail(ServiceLoader.java:232)
> - at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
> - at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
> - at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
> - at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
> - at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44)
> - at scala.collection.Iterator.foreach(Iterator.scala:941)
> - at scala.collection.Iterator.foreach$(Iterator.scala:941)
> - at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
> - at scala.collection.IterableLike.foreach(IterableLike.scala:74)
> - at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> - at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> - at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:250)
> - at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:248)
> - at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
> - at scala.collection.TraversableLike.filter(TraversableLike.scala:262)
> - at scala.collection.TraversableLike.filter$(TraversableLike.scala:262)
> - at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
> - at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:630)
> - at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
> - at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
> - at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> - at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> - at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> - at java.lang.reflect.Method.invoke(Method.java:498)
> - at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> - at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> - at py4j.Gateway.invoke(Gateway.java:282)
> - at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> - at py4j.commands.CallCommand.execute(CallCommand.java:79)
> - at py4j.GatewayConnection.run(GatewayConnection.java:238)
> - at java.lang.Thread.run(Thread.java:748)
> - Caused by: java.lang.NoClassDefFoundError: 
> org/apache/spark/sql/execution/datasources/FileFormat$class
> - at org.apache.spark.sql.avro.AvroFileFormat.<init>(AvroFileFormat.scala:44)
> - at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> - at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> - at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> - at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> - at java.lang.Class.newInstance(Class.java:442)
> - at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
> - ... 29 more
> - Caused by: java.lang.ClassNotFoundException: 
> org.apache.spark.sql.execution.datasources.FileFormat$class
> - at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> - at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> - at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> - at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> - ... 36 more
> {code}
>  
> The code we run looks like this:
>  
> {code:java}
> spark_session = (
>  SparkSession.builder
>  .appName(APPLICATION_NAME)
>  .master(MASTER_URL)
>  .config('spark.cassandra.connection.host', SERVER_IP_ADDRESS)
>  .config('spark.cassandra.auth.username', CASSANDRA_USERNAME)
>  .config('spark.cassandra.auth.password', CASSANDRA_PASSWORD)
>  .config('spark.sql.shuffle.partitions', 16)
>  .config('parquet.enable.summary-metadata', 'true')
>  .getOrCreate())
>  load_options = {
>  'keyspace': CASSANDRA_KEYSPACE,
>  'table': TABLE_NAME,
>  'spark.cassandra.input.fetch.size_in_rows': '150' }
>  df = (spark_session.read.format('org.apache.spark.sql.cassandra')
>  .options(**load_options)
>  .load())
> {code}
>  
> We get the exact same error when trying to read a local .avro file instead of 
> from Cassandra.
> Up to now we included the .jar file for Spark-Avro using the spark-submit 
> --jars option. The version of Spark-Avro that we used, and worked with Spark 
> 2.4.1, was Spark-Avro 2.4.0.
> In an attempt to fix this problem we tried updating the .jar file version. We 
> also tried using the --packages option, with different version combinations, 
> but none of these solutions worked. The same error shows up every time. 
> When rolling back to Spark 2.4.1 with the exact same setup and code, the 
> error doesn't show up and everything works fine. 
> Any ideas on what could be causing this?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to