[
https://issues.apache.org/jira/browse/HUDI-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Istvan Darvas updated HUDI-4046:
--------------------------------
Description:
Hi Guys!
I would like to controll the number of partions which will be read by HUDI.
base_path: str
partition_paths: List[str] = ["prefix/part1","prefix/part2","prefix/part3"]
ingress_pkg_arrived = (spark.read
.format('org.apache.hudi')
.option("basePath", base_path)
.option("hoodie.datasource.read.paths",",".join(partition_paths)) # coma
separated list
.load(partition_paths))
This is working if I explicitly set "hoodie.datasource.read.paths". Actually I
need to generate a comaseparated list for that parameter.
If I do not set it, I got a HUDI exception which tells me I need to set it.
It would be grate if HUDI would use the partition_paths from the Spark Read API
- list[str] -
one more thing:
I do not get exception If do not set "hoodie.datasource.read.paths" and I use
load(base_path), but in this case spark HUDI read will read up the whole table
which can be very timeconsuming with a very big table with lots of partitions.
Darvi
Connected SLACK thread:
https://apache-hudi.slack.com/archives/C4D716NPQ/p1651667472584579
was:
Hi Guys!
I would like to controll the number of partions which will be read by HUDI.
base_path: str
partition_paths: List[str] = ["prefix/part1","prefix/part2","prefix/part3"]
ingress_pkg_arrived = (spark.read
.format('org.apache.hudi')
.option("basePath", base_path)
.option("hoodie.datasource.read.paths",",".join(partition_paths)) # coma
separated list
.load(partition_paths))
This is working if I explicitly set "hoodie.datasource.read.paths". Actually I
need to generate a comaseparated list for that parameter.
If I do not set it, I got a HUDI exception which tells me I need to set it.
It would be grate if HUDI would use the partition_paths from the Spark Read API
- list[str] -
one more thing:
I do not get exception If do not set "hoodie.datasource.read.paths" and I use
load(base_path), but in this case spark HUDI read will read up the whole table
which can be very timeconsuming with a very big table with lots of partitions.
Darvi
> spark.read.load API
> -------------------
>
> Key: HUDI-4046
> URL: https://issues.apache.org/jira/browse/HUDI-4046
> Project: Apache Hudi
> Issue Type: Bug
> Affects Versions: 0.10.1
> Reporter: Istvan Darvas
> Priority: Minor
>
> Hi Guys!
> I would like to controll the number of partions which will be read by HUDI.
>
> base_path: str
> partition_paths: List[str] = ["prefix/part1","prefix/part2","prefix/part3"]
> ingress_pkg_arrived = (spark.read
> .format('org.apache.hudi')
> .option("basePath", base_path)
> .option("hoodie.datasource.read.paths",",".join(partition_paths)) # coma
> separated list
> .load(partition_paths))
>
> This is working if I explicitly set "hoodie.datasource.read.paths". Actually
> I need to generate a comaseparated list for that parameter.
> If I do not set it, I got a HUDI exception which tells me I need to set it.
>
> It would be grate if HUDI would use the partition_paths from the Spark Read
> API - list[str] -
>
> one more thing:
> I do not get exception If do not set "hoodie.datasource.read.paths" and I
> use load(base_path), but in this case spark HUDI read will read up the whole
> table which can be very timeconsuming with a very big table with lots of
> partitions.
>
> Darvi
> Connected SLACK thread:
> https://apache-hudi.slack.com/archives/C4D716NPQ/p1651667472584579
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)