zafer-sahin opened a new issue #2498:
URL: https://github.com/apache/hudi/issues/2498


   
   - Hudi is not able to read MERGE_ON_READ table when using the versions 
[0.6.0] and [0.7.0]   When I run the same code with the version [0.5.3] I am 
able to read the table generated by the option of merge on read.
   
   
   **Steps to reproduce the behavior:**
   
   **1.**Start a pyspark shell
   **2.**`pyspark --packages 
org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.0
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'`
   Or
   `pyspark --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.4
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'`
   **3.**  ```
             >>>S3_SNAPSHOT = <snapshot location>
             >>>S3_MERGE_ON_READ = <location to replicate data>
             >>> from pyspark.sql.functions import *
             >>> df = spark.read.parquet(S3_SNAPSHOT)
            >>>df.count()                                                       
           
   21/01/27 14:49:13 WARN package: Truncated the string representation of a 
plan since it was too large. This behavior can be adjusted by setting 
'spark.sql.debug.maxToStringFields'.
   950897550  
               >>> hudi_options_insert = {
               ...     "hoodie.table.name": "sample_schema.table_name",
               ...     "hoodie.datasource.write.storage.type": "MERGE_ON_READ",
               ...     "hoodie.datasource.write.recordkey.field": "id",
               ...     "hoodie.datasource.write.operation": "bulk_insert",
               ...     "hoodie.datasource.write.partitionpath.field": "ds",
               ...     "hoodie.datasource.write.precombine.field": "id",
               ...     "hoodie.insert.shuffle.parallelism": 135
               ...     }
            
>>>df.write.format("hudi").options(**hudi_options_insert).mode("overwrite").save(S3_MERGE_ON_READ)
   ```
   
   
   4.
   
   **Expected behavior**
   
   Data is loaded to dataframe perfectly when spark shell is created with the 
parameters:
   `pyspark --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   `
   
   **Environment Description**
   EMR 
   * Hudi version :[0.7.0], [0.6.0] is giving error.  [0.5.3] is running 
fluently
   
   * Spark version : [2.4.4], [3.0.1]
   
   * Hive version : 
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no 
   
   
   
   
   **Stacktrace**
   
   ```
           >>> df_mor = spark.read.format("hudi").load(S3_MERGE_ON_READ + "/*")
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 178, in load
       return self._df(self._jreader.load(path))
     File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", 
line 1305, in __call__
     File "/usr/lib/spark/python/pyspark/sql/utils.py", line 128, in deco
       return f(*a, **kw)
     File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", 
line 328, in get_return_value
   py4j.protocol.Py4JJavaError: An error occurred while calling o86.load.
   : java.lang.NoSuchMethodError: 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(Lorg/apache/spark/sql/SparkSession;Lscala/collection/Seq;Lscala/collection/immutable/Map;Lscala/Option;Lorg/apache/spark/sql/execution/datasources/FileStatusCache;)V
        at 
org.apache.hudi.HoodieSparkUtils$.createInMemoryFileIndex(HoodieSparkUtils.scala:89)
        at 
org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:127)
        at 
org.apache.hudi.MergeOnReadSnapshotRelation.<init>(MergeOnReadSnapshotRelation.scala:72)
        at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:89)
        at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:53)
        at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
        at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
        at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)
   ```
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to