[jira] [Created] (HUDI-4046) spark.read.load API

Istvan Darvas (Jira) Thu, 05 May 2022 04:44:05 -0700

Istvan Darvas created HUDI-4046:
-----------------------------------

             Summary: spark.read.load API
                 Key: HUDI-4046
                 URL: https://issues.apache.org/jira/browse/HUDI-4046
             Project: Apache Hudi
          Issue Type: Bug
    Affects Versions: 0.10.1
            Reporter: Istvan Darvas



Hi Guys!

I would like to controll the number of partions which will be read by HUDI.

 
base_path: str
partition_paths: List[str] = ["prefix/part1","prefix/part2","prefix/part3"]
ingress_pkg_arrived = (spark.read
                       .format('org.apache.hudi')
                       .option("basePath", base_path)
                       
.option("hoodie.datasource.read.paths",",".join(partition_paths)) # coma 
separated list
                       .load(partition_paths))
 
This is working if I explicitly set "hoodie.datasource.read.paths". Actually I 
need to generate a comaseparated list for that parameter.
If I do not set it, I got a HUDI exception which tells me I need to set it.
 
It would be grate if HUDI would use the partition_paths from the Spark Read API 
- list[str] -
 
one more thing:
 I do not get exception If do not set "hoodie.datasource.read.paths" and I use 
load(base_path), but in this case spark HUDI read will read up the whole table 
which can be very timeconsuming with a very big table with lots of partitions.
 
Darvi
 
 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (HUDI-4046) spark.read.load API

Reply via email to