[jira] [Updated] (HUDI-3896) Support Spark optimizations for `HadoopFsRelation`

Alexey Kudinkin (Jira) Sat, 16 Apr 2022 13:48:03 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-3896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alexey Kudinkin updated HUDI-3896:
----------------------------------
    Description: 
After migrating to Hudi's own Relation impls, we unfortunately broke off some 
of the optimizations that Spark apply exclusively for `HadoopFsRelation`.

 

While these optimizations could be perfectly implemented for any 
`FileRelation`, Spark is unfortunately predicating them on usage of 
HadoopFsRelation, therefore making them non-applicable to any of the Hudi's 
relations.

Proper longterm solutions would be fixing this in Spark and could be either of:
 # Generalizing such optimizations to any `FileRelation`
 # Making `HadoopFsRelation` extensible (making it non-case class)

 

One example of this is Spark's `SchemaPrunning` optimization rule (HUDI-3891): 
Spark 3.2.x is able to effectively reduce amount of data read via schema 
pruning (projecting read data) even for nested structs, however this 
optimization is predicated on the usage of `HadoopFsRelation`:

!Screen Shot 2022-04-16 at 1.46.50 PM.png|width=739,height=143!

  was:
After migrating to Hudi's own Relation impls, we unfortunately broke off some 
of the optimizations that Spark apply exclusively for `HadoopFsRelation`.

 

While these optimizations could be perfectly implemented for any 
`FileRelation`, Spark is unfortunately predicating them on usage of 
HadoopFsRelation, therefore making them non-applicable to any of the Hudi's 
relations.

Proper longterm solutions would be fixing this in Spark and could be either of:
 # Generalizing such optimizations to any `FileRelation`
 # Making `HadoopFsRelation` extensible (making it non-case class)


> Support Spark optimizations for `HadoopFsRelation`
> --------------------------------------------------
>
>                 Key: HUDI-3896
>                 URL: https://issues.apache.org/jira/browse/HUDI-3896
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Alexey Kudinkin
>            Assignee: Alexey Kudinkin
>            Priority: Blocker
>             Fix For: 0.12.0
>
>         Attachments: Screen Shot 2022-04-16 at 1.46.50 PM.png
>
>
> After migrating to Hudi's own Relation impls, we unfortunately broke off some 
> of the optimizations that Spark apply exclusively for `HadoopFsRelation`.
>  
> While these optimizations could be perfectly implemented for any 
> `FileRelation`, Spark is unfortunately predicating them on usage of 
> HadoopFsRelation, therefore making them non-applicable to any of the Hudi's 
> relations.
> Proper longterm solutions would be fixing this in Spark and could be either 
> of:
>  # Generalizing such optimizations to any `FileRelation`
>  # Making `HadoopFsRelation` extensible (making it non-case class)
>  
> One example of this is Spark's `SchemaPrunning` optimization rule 
> (HUDI-3891): Spark 3.2.x is able to effectively reduce amount of data read 
> via schema pruning (projecting read data) even for nested structs, however 
> this optimization is predicated on the usage of `HadoopFsRelation`:
> !Screen Shot 2022-04-16 at 1.46.50 PM.png|width=739,height=143!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3896) Support Spark optimizations for `HadoopFsRelation`

Reply via email to