jtmzheng commented on issue #2566:
URL: https://github.com/apache/hudi/issues/2566#issuecomment-778854746
```
# NB: We use this base image for leveraging Docker support on EMR 6.x
FROM amazoncorretto:8
RUN yum -y update
RUN yum -y install yum-utils
RUN yum -y groupinstall development
RUN yum -y install python3 python3-dev python3-pip python3-virtualenv
RUN yum -y install lzo-devel lzo
ENV PYSPARK_DRIVER_PYTHON python3
ENV PYSPARK_PYTHON python3
RUN ln -sf /usr/bin/python3 /usr/bin/python & \
ln -sf /usr/bin/pip3 /usr/bin/pip
RUN pip install pyspark==3.0.0
RUN pip install pytest==6.1.1
COPY ./test_hudi.py .
# RUN py.test -s --verbose test_hudi.py
```
I was able to reproduce this issue with the above Dockerfile, building with
`docker build -f hudi.Dockerfile -t test_hudi .` and running `py.test -s
--verbose test_hudi.py` in the container. I have edited `test_hudi.py` to the
below (mainly to add `spark-sql_2.12:3.0.0` though exact same issue as initial
test:
```
import pytest
from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import Row
def test_hudi(tmp_path):
SparkContext.getOrCreate(
conf=SparkConf()
.setAppName("testing")
.setMaster("local[6]")
.set(
"spark.jars.packages",
"org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.0,org.apache.spark:spark-sql_2.12:3.0.0",
)
.set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
.set("spark.sql.hive.convertMetastoreParquet", "false")
)
spark = SparkSession.builder.getOrCreate()
hudi_options = {
"hoodie.table.name": "test",
"hoodie.datasource.write.recordkey.field": "id",
"hoodie.datasource.write.keygenerator.class":
"org.apache.hudi.keygen.ComplexKeyGenerator",
"hoodie.datasource.write.partitionpath.field": "year,month,day",
"hoodie.datasource.write.table.name": "test",
"hoodie.datasource.write.table.type": "MERGE_ON_READ",
"hoodie.datasource.write.operation": "upsert",
"hoodie.datasource.write.precombine.field": "ts",
}
df = spark.createDataFrame(
[
Row(id=1, year=2020, month=7, day=5, ts=1),
]
)
df.write.format("hudi").options(**hudi_options).mode("append").save(str(tmp_path))
read_df = spark.read.format("parquet").load(str(tmp_path) + "/*/*/*")
# This works
print(read_df.collect())
read_df = spark.read.format("hudi").load(str(tmp_path) + "/*/*/*")
# This does not
print(read_df.collect())
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]