[
https://issues.apache.org/jira/browse/HUDI-8475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
sivabalan narayanan updated HUDI-8475:
--------------------------------------
Description:
As of now, some stats are sent to HoodieBackedTableMetadataWriter from the
write handles, while some are generated by reading the base files or log
on-demand within HoodieBackedTableMetadataWriter.update() call.
With the new dag design, we wanted to populate all stats within the DT write
handles only and send them back via WriteStatus.
We do not plan to collect entire WriteStatus in the driver, but just the
HoodieWRiteStat, and so we should be able to manage with WriteStatus holding
all the required stats for all indexes in MDT.
FILES: no additional work required
col stats: generate all stats within write handles including base data files.
bloom index: Better to ignore it here. We can do on-demand read from base files
from within the MDTPartitioner.
functional index stats: same as col stats. For bloom, we can defer.
RLI: no additional work required.
secondary index: lets generate all required stats from within all 3 write
handles. for Append handle, we might have to read the entire file slice
including the current file being written and stats have to be generated.
partition stats: yet to design this.
was:
As of now, some stats are sent to HoodieBackedTableMetadataWriter from the
write handles, while some are generated by reading the base files or log
on-demand within HoodieBackedTableMetadataWriter.update() call.
With the new dag design, we wanted to populate all stats within the DT write
handles only and send them back via WriteStatus.
We do not plan to collect entire WriteStatus in the driver, but just the
HoodieWRiteStat, and so we should be able to manage with WriteStatus holding
all the required stats for all indexes in MDT.
> Generate all required stats required for MDT within DT write handles
> --------------------------------------------------------------------
>
> Key: HUDI-8475
> URL: https://issues.apache.org/jira/browse/HUDI-8475
> Project: Apache Hudi
> Issue Type: Sub-task
> Components: metadata, writer-core
> Reporter: sivabalan narayanan
> Assignee: sivabalan narayanan
> Priority: Major
> Fix For: 1.0.0
>
>
> As of now, some stats are sent to HoodieBackedTableMetadataWriter from the
> write handles, while some are generated by reading the base files or log
> on-demand within HoodieBackedTableMetadataWriter.update() call.
> With the new dag design, we wanted to populate all stats within the DT write
> handles only and send them back via WriteStatus.
> We do not plan to collect entire WriteStatus in the driver, but just the
> HoodieWRiteStat, and so we should be able to manage with WriteStatus holding
> all the required stats for all indexes in MDT.
>
> FILES: no additional work required
> col stats: generate all stats within write handles including base data files.
> bloom index: Better to ignore it here. We can do on-demand read from base
> files from within the MDTPartitioner.
> functional index stats: same as col stats. For bloom, we can defer.
> RLI: no additional work required.
> secondary index: lets generate all required stats from within all 3 write
> handles. for Append handle, we might have to read the entire file slice
> including the current file being written and stats have to be generated.
> partition stats: yet to design this.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)