[
https://issues.apache.org/jira/browse/HUDI-3896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexey Kudinkin updated HUDI-3896:
----------------------------------
Description:
After migrating to Hudi's own Relation impls, we unfortunately broke off some
of the optimizations that Spark apply exclusively for `HadoopFsRelation`.
While these optimizations could be perfectly implemented for any
`FileRelation`, Spark is unfortunately predicating them on usage of
HadoopFsRelation, therefore making them non-applicable to any of the Hudi's
relations.
Proper longterm solutions would be fixing this in Spark and could be either of:
# Generalizing such optimizations to any `FileRelation`
# Making `HadoopFsRelation` extensible (making it non-case class)
One example of this is Spark's `SchemaPrunning` optimization rule (HUDI-3891):
Spark 3.2.x is able to effectively reduce amount of data read via schema
pruning (projecting read data) even for nested structs, however this
optimization is predicated on the usage of `HadoopFsRelation`:
!Screen Shot 2022-04-16 at 1.46.50 PM.png|width=739,height=143!
was:
After migrating to Hudi's own Relation impls, we unfortunately broke off some
of the optimizations that Spark apply exclusively for `HadoopFsRelation`.
While these optimizations could be perfectly implemented for any
`FileRelation`, Spark is unfortunately predicating them on usage of
HadoopFsRelation, therefore making them non-applicable to any of the Hudi's
relations.
Proper longterm solutions would be fixing this in Spark and could be either of:
# Generalizing such optimizations to any `FileRelation`
# Making `HadoopFsRelation` extensible (making it non-case class)
> Support Spark optimizations for `HadoopFsRelation`
> --------------------------------------------------
>
> Key: HUDI-3896
> URL: https://issues.apache.org/jira/browse/HUDI-3896
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: Alexey Kudinkin
> Assignee: Alexey Kudinkin
> Priority: Blocker
> Fix For: 0.12.0
>
> Attachments: Screen Shot 2022-04-16 at 1.46.50 PM.png
>
>
> After migrating to Hudi's own Relation impls, we unfortunately broke off some
> of the optimizations that Spark apply exclusively for `HadoopFsRelation`.
>
> While these optimizations could be perfectly implemented for any
> `FileRelation`, Spark is unfortunately predicating them on usage of
> HadoopFsRelation, therefore making them non-applicable to any of the Hudi's
> relations.
> Proper longterm solutions would be fixing this in Spark and could be either
> of:
> # Generalizing such optimizations to any `FileRelation`
> # Making `HadoopFsRelation` extensible (making it non-case class)
>
> One example of this is Spark's `SchemaPrunning` optimization rule
> (HUDI-3891): Spark 3.2.x is able to effectively reduce amount of data read
> via schema pruning (projecting read data) even for nested structs, however
> this optimization is predicated on the usage of `HadoopFsRelation`:
> !Screen Shot 2022-04-16 at 1.46.50 PM.png|width=739,height=143!
--
This message was sent by Atlassian Jira
(v8.20.1#820001)