zafer-sahin opened a new issue #2498:
URL: https://github.com/apache/hudi/issues/2498
- Hudi is not able to read MERGE_ON_READ table when using the versions
[0.6.0] and [0.7.0] When I run the same code with the version [0.5.3] I am
able to read the table generated by the option of merge on read.
**Steps to reproduce the behavior:**
**1.**Start a pyspark shell
**2.**`pyspark --packages
org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.0
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'`
Or
`pyspark --packages
org.apache.hudi:hudi-spark-bundle_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.4
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'`
**3.** ```
>>>S3_SNAPSHOT = <snapshot location>
>>>S3_MERGE_ON_READ = <location to replicate data>
>>> from pyspark.sql.functions import *
>>> df = spark.read.parquet(S3_SNAPSHOT)
>>>df.count()
21/01/27 14:49:13 WARN package: Truncated the string representation of a
plan since it was too large. This behavior can be adjusted by setting
'spark.sql.debug.maxToStringFields'.
950897550
>>> hudi_options_insert = {
... "hoodie.table.name": "sample_schema.table_name",
... "hoodie.datasource.write.storage.type": "MERGE_ON_READ",
... "hoodie.datasource.write.recordkey.field": "id",
... "hoodie.datasource.write.operation": "bulk_insert",
... "hoodie.datasource.write.partitionpath.field": "ds",
... "hoodie.datasource.write.precombine.field": "id",
... "hoodie.insert.shuffle.parallelism": 135
... }
>>>df.write.format("hudi").options(**hudi_options_insert).mode("overwrite").save(S3_MERGE_ON_READ)
```
4.
**Expected behavior**
Data is loaded to dataframe perfectly when spark shell is created with the
parameters:
`pyspark --packages
org.apache.hudi:hudi-spark-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
`
**Environment Description**
EMR
* Hudi version :[0.7.0], [0.6.0] is giving error. [0.5.3] is running
fluently
* Spark version : [2.4.4], [3.0.1]
* Hive version :
* Hadoop version :
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : no
**Stacktrace**
```
>>> df_mor = spark.read.format("hudi").load(S3_MERGE_ON_READ + "/*")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 178, in load
return self._df(self._jreader.load(path))
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
line 1305, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 128, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o86.load.
: java.lang.NoSuchMethodError:
org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(Lorg/apache/spark/sql/SparkSession;Lscala/collection/Seq;Lscala/collection/immutable/Map;Lscala/Option;Lorg/apache/spark/sql/execution/datasources/FileStatusCache;)V
at
org.apache.hudi.HoodieSparkUtils$.createInMemoryFileIndex(HoodieSparkUtils.scala:89)
at
org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:127)
at
org.apache.hudi.MergeOnReadSnapshotRelation.<init>(MergeOnReadSnapshotRelation.scala:72)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:89)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:53)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
at
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
at
org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]