[ 
https://issues.apache.org/jira/browse/HUDI-5236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5236:
----------------------------------
    Description: 
*Problem Statement*

Currently, MT performance is hardly predictable due to variety of factors such 
as, for ex, 
whether the MT is compacted: if table is NOT compacted, when loading "files" 
partition for ex, we will load all of the delta-log files materializing them 
in-memory,  meaning that all subsequent requests will be served from memory. 
However, when table IS compacted, we will only prematerialize the updated 
records but not the records sitting in the base file, which would require us to 
go fetch from base HFile every time (even though there's block-level caching 
implemented inside HFile reader).

More generally, `HoodieBackedTableMetadata` being the primary facade and 
interface for MT, currently doesn't have a well thought-through architecture 
and APIs, instead it serves simply as an aggregation layer for the lower-level 
components (LogRecordScanner, FileReader, etc).

This is problematic, since MT is a core component performance of which has 
direct implication on the query planning and beyond. As such, it has to have:
 # {*}Predictable performance{*}: how state of MT affects performance should be 
easy to comprehend and reason about (for ex, {_}it's expected that performance 
could be decreasing, with increase in scale or if the table is not compacted 
for a long time; however it's totally unexpected that performance could become 
worse than it was after compaction{_})
 # {*}Have clear configuration levers{*}: behavior, performance of the MT 
should have crystal clear configuration levers – whether records are 
materialized in-memory or loaded dynamically, 

 

*Solution*

To address aforementioned problems, we propose to implement 
HoodieBackedTableMetadataV2 providing
 * {*}Materialization{*}: it should allow MT to be read in either of 2 ways
 ** _Eagerly:_ when whole MT is loaded in-memory before accessing
 ** _Lazily:_ when MT is queried on an ad-hoc basis, however caching the 
results of the previous queries for subsequent use
 *  

  was:
*Problem Statement*

Currently, MT performance is hardly predictable due to variety of factors such 
as, for ex, 
whether the MT is compacted: if table is NOT compacted, when loading "files" 
partition for ex, we will load all of the delta-log files materializing them 
in-memory,  meaning that all subsequent requests will be served from memory. 
However, when table IS compacted, we will only prematerialize the updated 
records but not the records sitting in the base file, which would require us to 
go fetch from base HFile every time (even though there's block-level caching 
implemented inside HFile reader).

More generally, `HoodieBackedTableMetadata` being the primary facade and 
interface for MT, currently doesn't have a well thought-through architecture 
and APIs, instead it serves simply as an aggregation layer for the lower-level 
components (LogRecordScanner, FileReader, etc).

This is problematic, since MT is a core component performance of which has 
direct implication on the query planning and beyond. As such, it has to have:
 # {*}Predictable performance{*}: how state of MT affects performance should be 
easy to comprehend and reason about (for ex, {_}it's expected that performance 
could be decreasing, with increase in scale or if the table is not compacted 
for a long time; however it's totally unexpected that performance could become 
worse than it was after compaction{_})
 # {*}Have clear configuration levers{*}: behavior, performance of the MT 
should have crystal clear configuration levers – whether records are 
materialized in-memory or loaded dynamically, 

 

*Solution*

To address aforementioned problems, we propose to implement 
HoodieBackedTableMetadataV2 providing
 * {*}Materialization{*}: it should allow MT to be read in either of 2 ways
 ** _Eagerly:_ when whole MT is loaded in-memory before accessing
 ** _Lazily:_ when MT is queried on an ad-hoc basis, however caching the 
results of the previous queries for subsequent use
 * {*}Configuration{*}: it should be easy to prod 


> Implement HoodieBackedTableMetadata v2
> --------------------------------------
>
>                 Key: HUDI-5236
>                 URL: https://issues.apache.org/jira/browse/HUDI-5236
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: Alexey Kudinkin
>            Assignee: Alexey Kudinkin
>            Priority: Blocker
>             Fix For: 0.13.1
>
>
> *Problem Statement*
> Currently, MT performance is hardly predictable due to variety of factors 
> such as, for ex, 
> whether the MT is compacted: if table is NOT compacted, when loading "files" 
> partition for ex, we will load all of the delta-log files materializing them 
> in-memory,  meaning that all subsequent requests will be served from memory. 
> However, when table IS compacted, we will only prematerialize the updated 
> records but not the records sitting in the base file, which would require us 
> to go fetch from base HFile every time (even though there's block-level 
> caching implemented inside HFile reader).
> More generally, `HoodieBackedTableMetadata` being the primary facade and 
> interface for MT, currently doesn't have a well thought-through architecture 
> and APIs, instead it serves simply as an aggregation layer for the 
> lower-level components (LogRecordScanner, FileReader, etc).
> This is problematic, since MT is a core component performance of which has 
> direct implication on the query planning and beyond. As such, it has to have:
>  # {*}Predictable performance{*}: how state of MT affects performance should 
> be easy to comprehend and reason about (for ex, {_}it's expected that 
> performance could be decreasing, with increase in scale or if the table is 
> not compacted for a long time; however it's totally unexpected that 
> performance could become worse than it was after compaction{_})
>  # {*}Have clear configuration levers{*}: behavior, performance of the MT 
> should have crystal clear configuration levers – whether records are 
> materialized in-memory or loaded dynamically, 
>  
> *Solution*
> To address aforementioned problems, we propose to implement 
> HoodieBackedTableMetadataV2 providing
>  * {*}Materialization{*}: it should allow MT to be read in either of 2 ways
>  ** _Eagerly:_ when whole MT is loaded in-memory before accessing
>  ** _Lazily:_ when MT is queried on an ad-hoc basis, however caching the 
> results of the previous queries for subsequent use
>  *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to