[
https://issues.apache.org/jira/browse/HUDI-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexey Kudinkin updated HUDI-4178:
----------------------------------
Story Points: 4 (was: 1)
Summary: Performance regressions in Spark DataSourceV2 Integration
(was: HoodieSpark3Analysis does not pass schema from Spark Catalog)
> Performance regressions in Spark DataSourceV2 Integration
> ---------------------------------------------------------
>
> Key: HUDI-4178
> URL: https://issues.apache.org/jira/browse/HUDI-4178
> Project: Apache Hudi
> Issue Type: Bug
> Affects Versions: 0.11.0
> Reporter: Alexey Kudinkin
> Assignee: Alexey Kudinkin
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 0.11.1
>
>
> There are multiple issues with our current DataSource V2 integrations:
> Because we advertise Hudi tables as V2, Spark expects it to implement certain
> APIs which are not implemented at the moment, instead we're using custom
> Resolution rule (in HoodieSpark3Analysis) to instead manually fallback to V1
> APIs. This poses following problems
> # It doesn't fully implement Spark's protocol: for ex, this rule doesn't
> cache produced `LogicalPlan` making Spark re-create Hudi relations from
> scratch (including doing full table's file-listing) for every query reading
> this table. However, adding the caching in that sequence is not an option,
> since V2 APIs manage cache differently and therefore for us to be able to
> leverage that cache we will have to manage all of its lifecycle (adding,
> flushing)
> # Additionally, HoodieSpark3Analysis rule does not pass table's schema from
> the Spark Catalog to Hudi's relations making them fetch the schema from
> storage (either from commit's metadata or data file) every time
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)