JavierLopezT opened a new issue, #5490:
URL: https://github.com/apache/hudi/issues/5490

   Hello. I am facing an issue, and I am not even sure that it's Hudi's fault, 
but I am totally lost. Sorry if it's not indeed due to Hudi.
   
   I have a code that reads a commit Hudi file (JSON), takes some info from it, 
and then create a DataFrame reading some other Hudi files. The code is the 
following:
   ```
   def f(execute):
       table = "counters_raw"
       bucket = "s3://bucket-dev"
       s3_in_dev_storage_path = "IN-DEV/"
   
       commits_to_process = [
                   
f"{bucket}/{s3_in_dev_storage_path}counters_raw/.hoodie/20220503074605.commit",
                   
f"{bucket}/{s3_in_dev_storage_path}counters_raw/.hoodie/20220503074605.commit",
       ]
       for commit in commits_to_process:
           partitions = []
           #jsons = self.spark.read.format("json").load([commit], 
multiline=True)
   
           jsons = spark.read.option("multiline", "true").json(commit)
           write_partitions = jsons.select("writePartitionPaths").collect()
           for j in write_partitions:
               partitions += j["writePartitionPaths"]
   
           partitions = list(set(partitions))
           partition_list = [
                   f"{bucket}/{s3_in_dev_storage_path}{table}/{x}"
                   for x in partitions
               ]
           partition_paths = ",".join(partition_list)
           
           if execute:
               df_1 = (
                       spark.read.format("hudi")
                       .option("hoodie.datasource.read.paths", partition_paths)
                       .load()
                   )
   
           print("OK")
   ```
   
   If I run the function with `execute=False`, it works. If I read it with` 
execute=True`, it fails with the following error:
   ```
   An error was encountered:
   Unable to infer schema for JSON. It must be specified manually.
   Traceback (most recent call last):
     File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", 
line 372, in json
       return 
self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
     File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", 
line 1305, in __call__
       answer, self.gateway_client, self.target_id, self.name)
     File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 
117, in deco
       raise converted from None
   pyspark.sql.utils.AnalysisException: Unable to infer schema for JSON. It 
must be specified manually.
   ```
   
   
   **Environment Description**
   
   * Hudi version : 
   
   * Spark version : 3.1.2
   
   * Hive version : 3.1.2
   
   * Hadoop version : Amazon 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on: EMR, both EMR Jupyter and EMR Steps
   
   Any idea what is going on here? Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to