JavierLopezT opened a new issue, #5490:
URL: https://github.com/apache/hudi/issues/5490
Hello. I am facing an issue, and I am not even sure that it's Hudi's fault,
but I am totally lost. Sorry if it's not indeed due to Hudi.
I have a code that reads a commit Hudi file (JSON), takes some info from it,
and then create a DataFrame reading some other Hudi files. The code is the
following:
```
def f(execute):
table = "counters_raw"
bucket = "s3://bucket-dev"
s3_in_dev_storage_path = "IN-DEV/"
commits_to_process = [
f"{bucket}/{s3_in_dev_storage_path}counters_raw/.hoodie/20220503074605.commit",
f"{bucket}/{s3_in_dev_storage_path}counters_raw/.hoodie/20220503074605.commit",
]
for commit in commits_to_process:
partitions = []
#jsons = self.spark.read.format("json").load([commit],
multiline=True)
jsons = spark.read.option("multiline", "true").json(commit)
write_partitions = jsons.select("writePartitionPaths").collect()
for j in write_partitions:
partitions += j["writePartitionPaths"]
partitions = list(set(partitions))
partition_list = [
f"{bucket}/{s3_in_dev_storage_path}{table}/{x}"
for x in partitions
]
partition_paths = ",".join(partition_list)
if execute:
df_1 = (
spark.read.format("hudi")
.option("hoodie.datasource.read.paths", partition_paths)
.load()
)
print("OK")
```
If I run the function with `execute=False`, it works. If I read it with`
execute=True`, it fails with the following error:
```
An error was encountered:
Unable to infer schema for JSON. It must be specified manually.
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
line 372, in json
return
self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
line 1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line
117, in deco
raise converted from None
pyspark.sql.utils.AnalysisException: Unable to infer schema for JSON. It
must be specified manually.
```
**Environment Description**
* Hudi version :
* Spark version : 3.1.2
* Hive version : 3.1.2
* Hadoop version : Amazon 3.2.1
* Storage (HDFS/S3/GCS..) : S3
* Running on: EMR, both EMR Jupyter and EMR Steps
Any idea what is going on here? Thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]