Re: [D] Spark DataSource V2 read and write benchmarks [hudi]

via GitHub Wed, 26 Nov 2025 20:26:47 -0800


GitHub user geserdugarov edited a discussion: Spark DataSource V2 read and 
write benchmarks


Integration of Spark Datasource V2 was done in 
[RFC-38](https://github.com/apache/hudi/pull/3964). However, there were 
multiple issues with advertising a Hudi table as V2 without actually 
implementing certain APIs, and with using custom relation rule to fall back to 
V1 API. As a result, the current implementation of `HoodieCatalog` and 
`Spark3DefaultSource` returns a `V1Table` instead of `HoodieInternalV2Table`, 
in order to [address performance 
regressions](https://github.com/apache/hudi/pull/5737).

Performance issues were not revealed in the initial PR due to the absence of 
proper benchmarking for such changes. Therefore, to restart this work, it is 
important first to decide how to benchmark the changes. Among other things, 
Datasource V1 allows custom logic, such as the use of Hudi indexes, which is 
not straightforward to implement in Datasource V2. So we need to consider cases 
like this in the benchmarking scenarios.

If anybody has already gone down this path, please share your insights. Any 
suggestions about scenarios that should be considered are also welcome.

GitHub link: https://github.com/apache/hudi/discussions/13955

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Re: [D] Spark DataSource V2 read and write benchmarks [hudi]

Reply via email to