[
https://issues.apache.org/jira/browse/HUDI-5236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexey Kudinkin updated HUDI-5236:
----------------------------------
Description:
*Problem Statement*
Currently, MT performance is hardly predictable due to variety of factors such
as, for ex,
whether the MT is compacted: if table is NOT compacted, when loading "files"
partition for ex, we will load all of the delta-log files materializing them
in-memory, meaning that all subsequent requests will be served from memory.
However, when table IS compacted, we will only prematerialize the updated
records but not the records sitting in the base file, which would require us to
go fetch from base HFile every time (even though there's block-level caching
implemented inside HFile reader).
More generally, `HoodieBackedTableMetadata` being the primary facade and
interface for MT, currently doesn't have a well thought-through architecture
and APIs, instead it serves simply as an aggregation layer for the lower-level
components (LogRecordScanner, FileReader, etc).
This is problematic, since MT is a core component performance of which has
direct implication on the query planning and beyond. As such, it has to have:
# {*}Predictable performance{*}: how state of MT affects performance should be
easy to comprehend and reason about (for ex, {_}it's expected that performance
could be decreasing, with increase in scale or if the table is not compacted
for a long time; however it's totally unexpected that performance could become
worse than it was after compaction{_})
# {*}Have clear configuration levers{*}: behavior, performance of the MT
should have crystal clear configuration levers – whether records are
materialized in-memory or loaded dynamically,
*Solution*
To address aforementioned problems, we propose to implement
HoodieBackedTableMetadataV2 providing
* {*}Materialization{*}: it should allow MT to be read in either of 2 ways
** _Eagerly:_ when whole MT is loaded in-memory before accessing
** _Lazily:_ when MT is queried on an ad-hoc basis, however caching the
results of the previous queries for subsequent use
*
was:
*Problem Statement*
Currently, MT performance is hardly predictable due to variety of factors such
as, for ex,
whether the MT is compacted: if table is NOT compacted, when loading "files"
partition for ex, we will load all of the delta-log files materializing them
in-memory, meaning that all subsequent requests will be served from memory.
However, when table IS compacted, we will only prematerialize the updated
records but not the records sitting in the base file, which would require us to
go fetch from base HFile every time (even though there's block-level caching
implemented inside HFile reader).
More generally, `HoodieBackedTableMetadata` being the primary facade and
interface for MT, currently doesn't have a well thought-through architecture
and APIs, instead it serves simply as an aggregation layer for the lower-level
components (LogRecordScanner, FileReader, etc).
This is problematic, since MT is a core component performance of which has
direct implication on the query planning and beyond. As such, it has to have:
# {*}Predictable performance{*}: how state of MT affects performance should be
easy to comprehend and reason about (for ex, {_}it's expected that performance
could be decreasing, with increase in scale or if the table is not compacted
for a long time; however it's totally unexpected that performance could become
worse than it was after compaction{_})
# {*}Have clear configuration levers{*}: behavior, performance of the MT
should have crystal clear configuration levers – whether records are
materialized in-memory or loaded dynamically,
*Solution*
To address aforementioned problems, we propose to implement
HoodieBackedTableMetadataV2 providing
* {*}Materialization{*}: it should allow MT to be read in either of 2 ways
** _Eagerly:_ when whole MT is loaded in-memory before accessing
** _Lazily:_ when MT is queried on an ad-hoc basis, however caching the
results of the previous queries for subsequent use
* {*}Configuration{*}: it should be easy to prod
> Implement HoodieBackedTableMetadata v2
> --------------------------------------
>
> Key: HUDI-5236
> URL: https://issues.apache.org/jira/browse/HUDI-5236
> Project: Apache Hudi
> Issue Type: Improvement
> Reporter: Alexey Kudinkin
> Assignee: Alexey Kudinkin
> Priority: Blocker
> Fix For: 0.13.1
>
>
> *Problem Statement*
> Currently, MT performance is hardly predictable due to variety of factors
> such as, for ex,
> whether the MT is compacted: if table is NOT compacted, when loading "files"
> partition for ex, we will load all of the delta-log files materializing them
> in-memory, meaning that all subsequent requests will be served from memory.
> However, when table IS compacted, we will only prematerialize the updated
> records but not the records sitting in the base file, which would require us
> to go fetch from base HFile every time (even though there's block-level
> caching implemented inside HFile reader).
> More generally, `HoodieBackedTableMetadata` being the primary facade and
> interface for MT, currently doesn't have a well thought-through architecture
> and APIs, instead it serves simply as an aggregation layer for the
> lower-level components (LogRecordScanner, FileReader, etc).
> This is problematic, since MT is a core component performance of which has
> direct implication on the query planning and beyond. As such, it has to have:
> # {*}Predictable performance{*}: how state of MT affects performance should
> be easy to comprehend and reason about (for ex, {_}it's expected that
> performance could be decreasing, with increase in scale or if the table is
> not compacted for a long time; however it's totally unexpected that
> performance could become worse than it was after compaction{_})
> # {*}Have clear configuration levers{*}: behavior, performance of the MT
> should have crystal clear configuration levers – whether records are
> materialized in-memory or loaded dynamically,
>
> *Solution*
> To address aforementioned problems, we propose to implement
> HoodieBackedTableMetadataV2 providing
> * {*}Materialization{*}: it should allow MT to be read in either of 2 ways
> ** _Eagerly:_ when whole MT is loaded in-memory before accessing
> ** _Lazily:_ when MT is queried on an ad-hoc basis, however caching the
> results of the previous queries for subsequent use
> *
--
This message was sent by Atlassian Jira
(v8.20.10#820010)