[GitHub] [hudi] jtmzheng commented on issue #2566: [SUPPORT] Unable to read Hudi MOR data set in a test on 0.7

GitBox Sun, 14 Feb 2021 14:46:43 -0800


jtmzheng commented on issue #2566:
URL: https://github.com/apache/hudi/issues/2566#issuecomment-778854746



   ```
   # NB: We use this base image for leveraging Docker support on EMR 6.x
   FROM amazoncorretto:8
   
   RUN yum -y update
   RUN yum -y install yum-utils
   RUN yum -y groupinstall development
   
   RUN yum -y install python3 python3-dev python3-pip python3-virtualenv
   RUN yum -y install lzo-devel lzo
   
   ENV PYSPARK_DRIVER_PYTHON python3
   ENV PYSPARK_PYTHON python3
   
   RUN ln -sf /usr/bin/python3 /usr/bin/python & \
       ln -sf /usr/bin/pip3 /usr/bin/pip
   
   RUN pip install pyspark==3.0.0
   RUN pip install pytest==6.1.1
   COPY ./test_hudi.py .
   
   # RUN py.test -s --verbose test_hudi.py
   ```
   
   I was able to reproduce this issue with the above Dockerfile, building with 
`docker build -f hudi.Dockerfile -t test_hudi .` and running `py.test -s 
--verbose test_hudi.py` in the container. I have edited `test_hudi.py` to the 
below (mainly to add `spark-sql_2.12:3.0.0` though exact same issue as initial 
test:
   
   ```
   import pytest
   from pyspark import SparkConf
   from pyspark import SparkContext
   from pyspark.sql import SparkSession
   
   from pyspark.sql import Row
   
   
   def test_hudi(tmp_path):
       SparkContext.getOrCreate(
           conf=SparkConf()
           .setAppName("testing")
           .setMaster("local[6]")
           .set(
               "spark.jars.packages",
               
"org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.0,org.apache.spark:spark-sql_2.12:3.0.0",
           )
           .set("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")
           .set("spark.sql.hive.convertMetastoreParquet", "false")
       )
       spark = SparkSession.builder.getOrCreate()
   
       hudi_options = {
           "hoodie.table.name": "test",
           "hoodie.datasource.write.recordkey.field": "id",
           "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator",
           "hoodie.datasource.write.partitionpath.field": "year,month,day",
           "hoodie.datasource.write.table.name": "test",
           "hoodie.datasource.write.table.type": "MERGE_ON_READ",
           "hoodie.datasource.write.operation": "upsert",
           "hoodie.datasource.write.precombine.field": "ts",
       }
       df = spark.createDataFrame(
           [
               Row(id=1, year=2020, month=7, day=5, ts=1),
           ]
       )
       
df.write.format("hudi").options(**hudi_options).mode("append").save(str(tmp_path))
       read_df = spark.read.format("parquet").load(str(tmp_path) + "/*/*/*")
       # This works
       print(read_df.collect())
   
       read_df = spark.read.format("hudi").load(str(tmp_path) + "/*/*/*")
       # This does not
       print(read_df.collect())
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] jtmzheng commented on issue #2566: [SUPPORT] Unable to read Hudi MOR data set in a test on 0.7

Reply via email to