[
https://issues.apache.org/jira/browse/HUDI-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexey Kudinkin updated HUDI-3396:
----------------------------------
Fix Version/s: 0.11.0
> Make sure Spark reads only Projected Columns for both MOR/COW
> -------------------------------------------------------------
>
> Key: HUDI-3396
> URL: https://issues.apache.org/jira/browse/HUDI-3396
> Project: Apache Hudi
> Issue Type: Task
> Reporter: Alexey Kudinkin
> Priority: Blocker
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2022-02-08 at 4.58.12 PM.png
>
>
> Spark Relation impl for MOR table seem to have following issues:
> * `requiredSchemaParquetReader` still leverages full table schema, entailing
> that we're fetching *all* columns from Parquet (even though the query might
> just be projecting a handful)
> * `fullSchemaParquetReader` is always reading full-table to (presumably)be
> able to do merging which might access arbitrary key-fields. This seems
> superfluous, since we can only fetch the fields designated as
> `PRECOMBINE_FIELD_NAME` as well as `RECORDKEY_FIELD_NAME`. We won't be able
> to do that if either of the following is true:
> ** Virtual Keys are used (key-gen will require whole payload)
> ** Non-trivial merging strategy is used requiring whole record payload
>
> !Screen Shot 2022-02-08 at 4.58.12 PM.png!
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)