[
https://issues.apache.org/jira/browse/HUDI-3637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517173#comment-17517173
]
Ethan Guo edited comment on HUDI-3637 at 4/5/22 2:58 AM:
---------------------------------------------------------
I caught one scenario where the file listing from FS vs metadata table is
different. This is regarding the validation of the latest file slices using
AbstractTableFileSystemView::getLatestFileSlices API.
This is MOR table with async cleaning and compaction. The problem is when
there is inflight deltacommit in the data table timeline. For the data table,
here is the simplified timeline, the job failed after DC2 has corresponding
updates in MDT and before DC2 completes in data table.
DC1 (completed), DC2 (inflight)
In one latest file slice, the base file comes from DC1 and the log file
appending to it comes from DC2, which has not been committed. At this point,
AbstractTableFileSystemView::getLatestFileSlices returns different result if
using FS vs metadata table. Using FS, the API includes the inflight log file,
while with metadata table the API does not have the inflight log file.
It turns out that by design, AbstractTableFileSystemView::getLatestFileSlices
can return file slices that are in flight. The compaction and clustering
scheduling depends on this API. To make sure inflight slices don't get into
the plan, the timeline cannot have any inflight commits when doing compaction
and clustering scheduling.
was (Author: guoyihua):
I caught one scenario where the file listing from FS vs metadata table is
different. This is regarding the validation of the latest file slices using
AbstractTableFileSystemView::getLatestFileSlices API.
This is MOR table with async cleaning and compaction. The problem is when
there is inflight deltacommit in the data table timeline. For the data table,
here is the simplified timeline, the job failed after DC2 has corresponding
updates in MDT and before DC2 completes in data table.
DC1 (completed), DC2 (inflight)
In one latest file slice, the base file comes from DC1 and the log file
appending to it comes from DC2, which has not been committed. At this point,
AbstractTableFileSystemView::getLatestFileSlices returns different result if
using FS vs metadata table. Using FS, the API includes the inflight log file,
while with metadata table the API does not have the inflight log file.
It turns out that by design, AbstractTableFileSystemView::getLatestFileSlices
> Check file listing from FS vs metadata table when compaction in pending and
> inflight
> ------------------------------------------------------------------------------------
>
> Key: HUDI-3637
> URL: https://issues.apache.org/jira/browse/HUDI-3637
> Project: Apache Hudi
> Issue Type: Task
> Reporter: Ethan Guo
> Assignee: Ethan Guo
> Priority: Blocker
> Fix For: 0.11.0
>
>
> HoodieMetadataTableValidator validation of the latest base files and file
> slices fails due to the following (from MT, log files are missing, compared
> to FS view). The validation failure may be due to the inflight compaction.
> Need to investigate whether this affects the file listing for write
> operations. The behavior is that after some instants, the validation can
> pass, so the MT correct is guaranteed, but the file listing view may have a
> bug.
> {code:java}
> file slices from metadata: [FileSlice
> {fileGroupId=HoodieFileGroupId{partitionPath='2022/1/28',
> fileId='769bf7ac-d6d0-452c-bf54-bbe7e8381766-0'},
> baseCommitTime=20220314001058266,
> baseFile='HoodieBaseFile{fullPath=file:/Users/ethan/Work/scripts/mt_rollout_testing/deploy_c_multi_writer/c2_mor_010nomt_011mt/test_table/2022/1/28/769bf7ac-d6d0-452c-bf54-bbe7e8381766-0_2-47-485_20220314001058266.parquet,
> fileLen=106839698, BootstrapBaseFile=null}', logFiles='[]'}]
> file slices from file system and base files: [FileSlice
> {fileGroupId=HoodieFileGroupId{partitionPath='2022/1/28',
> fileId='769bf7ac-d6d0-452c-bf54-bbe7e8381766-0'},
> baseCommitTime=20220314001058266,
> baseFile='HoodieBaseFile{fullPath=file:/Users/ethan/Work/scripts/mt_rollout_testing/deploy_c_multi_writer/c2_mor_010nomt_011mt/test_table/2022/1/28/769bf7ac-d6d0-452c-bf54-bbe7e8381766-0_2-47-485_20220314001058266.parquet,
> fileLen=106839698, BootstrapBaseFile=null}',
> logFiles='[HoodieLogFile{pathStr='file:/Users/ethan/Work/scripts/mt_rollout_testing/deploy_c_multi_writer/c2_mor_010nomt_011mt/test_table/2022/1/28/.769bf7ac-d6d0-452c-bf54-bbe7e8381766-0_20220314001058266.log.1_2-111-954',
> fileLen=51607682}]'}]
> 22/03/14 00:33:03 ERROR HoodieMetadataTableValidator: Metadata table
> validation failed for 2022/1/28 due to HoodieValidationException {code}
> Compaction:
> {code:java}
> Partition Path │ FileId │ Base-Instant │
> Data File Path │
> Total Delta Files │ getMetrics
> ║
> ╠══
> 2022/1/28 │ 769bf7ac-d6d0-452c-bf54-bbe7e8381766-0 │ 20220314001058266
> │ 769bf7ac-d6d0-452c-bf54-bbe7e8381766-0_2-47-485_20220314001058266.parquet │
> 1 │ {TOTAL_LOG_FILES=1.0, TOTAL_IO_READ_MB=151.0,
> TOTAL_LOG_FILES_SIZE=5.1607682E7, TOTAL_IO_WRITE_MB=101.0, TOTAL_IO_MB=252.0}
> ║ {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)