[GitHub] [hudi] jtmzheng opened a new issue #2878: [SUPPORT] Unable to read Hudi MOR data set in a test on 0.8

GitBox Sun, 25 Apr 2021 22:12:23 -0700


jtmzheng opened a new issue #2878:
URL: https://github.com/apache/hudi/issues/2878



   **Describe the problem you faced**
   
   This is the same issue as 
https://github.com/apache/hudi/issues/2566#issuecomment-821583643 in 0.8, seems 
like the latest version did not fix the issue (unless I'm doing something wrong 
here).
   
   **To Reproduce**
   
   I was able to reproduce this issue with the Dockerfile below, building with 
docker build -f hudi.Dockerfile -t test_hudi . and running py.test -s --verbose 
test_hudi.py in the container.
   
   Steps to reproduce the behavior:
   
   Dockerfile:
   ```
   # NB: We use this base image for leveraging Docker support on EMR 6.x
   FROM amazoncorretto:8
   
   RUN yum -y update
   RUN yum -y install yum-utils
   RUN yum -y groupinstall development
   
   RUN yum -y install python3 python3-dev python3-pip python3-virtualenv
   RUN yum -y install lzo-devel lzo
   
   ENV PYSPARK_DRIVER_PYTHON python3
   ENV PYSPARK_PYTHON python3
   
   RUN ln -sf /usr/bin/python3 /usr/bin/python & \
       ln -sf /usr/bin/pip3 /usr/bin/pip
   
   RUN pip install pyspark==3.0.0
   RUN pip install pytest==6.1.1
   COPY ./test_hudi.py .
   
   # RUN py.test -s --verbose test_hudi.py
   ```
   
   test_hudi.py
   ```
   import pytest
   from pyspark import SparkConf
   from pyspark import SparkContext
   from pyspark.sql import SparkSession
   
   from pyspark.sql import Row
   
   
   def test_hudi(tmp_path):
       SparkContext.getOrCreate(
           conf=SparkConf()
           .setAppName("testing")
           .setMaster("local[6]")
           .set(
               "spark.jars.packages",
               
"org.apache.hudi:hudi-spark-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.0,org.apache.spark:spark-sql_2.12:3.0.0",
           )
           .set("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")
           .set("spark.sql.hive.convertMetastoreParquet", "false")
       )
       spark = SparkSession.builder.getOrCreate()
   
       hudi_options = {
           "hoodie.table.name": "test",
           "hoodie.datasource.write.recordkey.field": "id",
           "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator",
           "hoodie.datasource.write.partitionpath.field": "year,month,day",
           "hoodie.datasource.write.table.name": "test",
           "hoodie.datasource.write.table.type": "MERGE_ON_READ",
           "hoodie.datasource.write.operation": "upsert",
           "hoodie.datasource.write.precombine.field": "ts",
       }
       df = spark.createDataFrame(
           [
               Row(id=1, year=2020, month=7, day=5, ts=1),
           ]
       )
       
df.write.format("hudi").options(**hudi_options).mode("append").save(str(tmp_path))
       read_df = spark.read.format("parquet").load(str(tmp_path) + "/*/*/*")
       # This works
       print(read_df.collect())
   
       read_df = spark.read.format("hudi").load(str(tmp_path) + "/*/*/*")
       # This does not
       print(read_df.collect())
   ```
   
   **Expected behavior**
   The test above works. See https://issues.apache.org/jira/browse/HUDI-1568
   
   
   **Additional context**
   See https://issues.apache.org/jira/browse/HUDI-1568
   
   **Stacktrace**
   Same as https://github.com/apache/hudi/issues/2566#issue-805978132
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] jtmzheng opened a new issue #2878: [SUPPORT] Unable to read Hudi MOR data set in a test on 0.8

Reply via email to