Alexey Kudinkin created HUDI-5236:
-------------------------------------

             Summary: Implement HoodieBackedTableMetadata v2
                 Key: HUDI-5236
                 URL: https://issues.apache.org/jira/browse/HUDI-5236
             Project: Apache Hudi
          Issue Type: Improvement
            Reporter: Alexey Kudinkin
             Fix For: 0.13.1


*Problem Statement*

Currently, MT performance is hardly predictable due to variety of factors such 
as, for ex, 
whether the MT is compacted: if table is NOT compacted, when loading "files" 
partition for ex, we will load all of the delta-log files materializing them 
in-memory,  meaning that all subsequent requests will be served from memory. 
However, when table IS compacted, we will only prematerialize the updated 
records but not the records sitting in the base file, which would require us to 
go fetch from base HFile every time (even though there's block-level caching 
implemented inside HFile reader).

More generally, `HoodieBackedTableMetadata` being the primary facade and 
interface for MT, currently doesn't have a well thought-through architecture 
and APIs, instead it serves simply as an aggregation layer for the lower-level 
components (LogRecordScanner, FileReader, etc).

This is problematic, since MT is a core component performance of which has 
direct implication on the query planning and beyond. As such, it has to have:
 # {*}Predictable performance{*}: how state of MT affects performance should be 
easy to comprehend and reason about (for ex, {_}it's expected that performance 
could be decreasing, with increase in scale or if the table is not compacted 
for a long time; however it's totally unexpected that performance could become 
worse than it was after compaction{_})
 # {*}Have clear configuration levers{*}: behavior, performance of the MT 
should have crystal clear configuration levers – whether records are 
materialized in-memory or loaded dynamically, 

 

*Solution*

To address aforementioned problems, we propose to implement 
HoodieBackedTableMetadataV2 providing
 * {*}Materialization{*}: it should allow MT to be read in either of 2 ways
 ** _Eagerly:_ when whole MT is loaded in-memory before accessing
 ** _Lazily:_ when MT is queried on an ad-hoc basis, however caching the 
results of the previous queries for subsequent use
 * {*}Configuration{*}: it should be easy to prod 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to