[ 
https://issues.apache.org/jira/browse/HUDI-8282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17886830#comment-17886830
 ] 

sivabalan narayanan edited comment on HUDI-8282 at 10/4/24 3:50 AM:
--------------------------------------------------------------------

Lets first understand what we did in 0.x

 

We are looking to fix two problems by adding per log file marker.
a. i. MOR data table rollbacks missed to sync original log files from failed 
commit to MDT.
a.ii. Along these lines, if rollback instant is retried multiple times, any log 
files added from failed rollback attempts should also be synced to MDT.
b. If there are spurious log files created even w/ successful commits, we need 
to ensure these spurious log files are also synced to MDT.

So, to fix all of the above, we are adding per log file marker. Any log file 
added or appended to will create markers. We don't really need to distinguish 
between create and append and so we will go w/ APPEND IoType for markers.

Fix for (a. i): Any log file added will emit a marker. If the commit of 
interest failed, hudi will trigger a rollback. During rollback planning, using 
markers we identify the original log files added by the failed commit and track 
it as part of the rollback plan. This also gets tracked in 
HoodieRollbackMetadata (apache/hudi#9653 where we upgraded the schema).

Fix for (a.ii): Whenever a rollback is triggered, marker based rollback will be 
able to track all log files added as part of failed commit. and rollback 
execution hudi adds a rollback command block. With this patch, we are also 
emitting markers for such log files(rollback command blocks). During rollback 
execution, apart from adding log files added by failed commit to 
HoodieRollbackMetadata, we also add these log files which could have been added 
by previous attempts of rollback for the same instant.

Fix for (b): During marker based reconciliation step, we check for log files 
from markers and compare it against HoodieCommitMetadata's HoodieWriteStat. If 
for any additional files tracked using markers (which could happen due to spark 
retries), we will add new HoodieWriteStat and update HoodieCommitMetadata. So, 
that when this syncs to MDT, we don't miss to track this spurious log files. We 
will use apache/hudi#9545 to skip such spurious log files on the reader side. 
So, on the writer side, we just want to ensure we don't miss to track any log 
file created by hudi.

Note: Please do note that the reconciliation for log files is kind of opposite 
of what happens w/ data files. w/ data files, any extraneous files are deleted. 
But for any extaneous log files, we can't afford to delete. Since there could 
be a concurrent reader trying to read the the file slice of interest. 
Eventually during execution, it might parse the log block header and might skip 
if its partially failed commit or inflight commit. Anyways, in short, we can't 
afford to delete any log files at any point in time except cleaner. So, for any 
extraneous log files detected, we fix the HoodieCommitMetadata to track these 
additional log files as well.

 

 

1.x:

With 1.x, there are three major behavior difference. Log files are deleted 
during rollbacks. And so, rollback will not add any rollback command blocks. 
and appends are disabled, which means there will be only one log block per log 
file. 

Having said this, lets see what we might want to do in 1.x :


a. Identify log files to be deleted for failed writs. 
b. If there are spurious log files created even w/ successful commits, we need 
to ensure these spurious log files are also synced to MDT if need be or deleted 
diligently. 

 

So, to fix both, we are adding per log file marker. Any log file added or 
appended to will create markers. We don't really need to distinguish between 
create and append and so we will go w/ APPEND IoType for markers.

Fix for (a): Any log file added will emit a marker. If the commit of interest 
failed, hudi will trigger a rollback. During rollback planning, using markers 
we identify the original log files added by the failed commit and track it as 
part of the rollback plan. and the log files will be deleted during rollback. 

So, even if original dc was not synced to MDT, whenever rollback syncs to MDT, 
it might trigger a delete of a non-existant log file (from MDT's view point) 
and we should be good. 

Even if rollback goes through multiple attempts in data table, we do not have 
any issue as we have in 0.x since rollback in 1.0 will not add a new log file 
in data table, but will only delete files belonging to partially failed writes. 

Fix for (b): During marker based reconciliation step, we check for log files 
from markers and compare it against HoodieCommitMetadata's HoodieWriteStat. If 
for any additional files tracked using markers (which could happen due to spark 
retries), we just delete them similar to how we delete spurious base files. 

 

So, in summary, all we need is to ensure each log file has an individual marker 
generated. And during marker based reconciliation, any spurious log files will 
also be deleted. 

 

 

 


was (Author: shivnarayan):
For eg, here is a comparison of what we did in 0.x

 

We are looking to fix two problems by adding per log file marker.
a. i. MOR data table rollbacks missed to sync original log files from failed 
commit to MDT.
a.ii. Along these lines, if rollback instant is retried multiple times, any log 
files added from failed rollback attempts should also be synced to MDT.
b. If there are spurious log files created even w/ successful commits, we need 
to ensure these spurious log files are also synced to MDT.

So, to fix all of the above, we are adding per log file marker. Any log file 
added or appended to will create markers. We don't really need to distinguish 
between create and append and so we will go w/ APPEND IoType for markers.

Fix for (a. i): Any log file added will emit a marker. If the commit of 
interest failed, hudi will trigger a rollback. During rollback planning, using 
markers we identify the original log files added by the failed commit and track 
it as part of the rollback plan. This also gets tracked in 
HoodieRollbackMetadata (apache/hudi#9653 where we upgraded the schema).

Fix for (a.ii): Whenever a rollback is triggered, marker based rollback will be 
able to track all log files added as part of failed commit. and rollback 
execution hudi adds a rollback command block. With this patch, we are also 
emitting markers for such log files(rollback command blocks). During rollback 
execution, apart from adding log files added by failed commit to 
HoodieRollbackMetadata, we also add these log files which could have been added 
by previous attempts of rollback for the same instant.

Fix for (b): During marker based reconciliation step, we check for log files 
from markers and compare it against HoodieCommitMetadata's HoodieWriteStat. If 
for any additional files tracked using markers (which could happen due to spark 
retries), we will add new HoodieWriteStat and update HoodieCommitMetadata. So, 
that when this syncs to MDT, we don't miss to track this spurious log files. We 
will use apache/hudi#9545 to skip such spurious log files on the reader side. 
So, on the writer side, we just want to ensure we don't miss to track any log 
file created by hudi.

Note: Please do note that the reconciliation for log files is kind of opposite 
of what happens w/ data files. w/ data files, any extraneous files are deleted. 
But for any extaneous log files, we can't afford to delete. Since there could 
be a concurrent reader trying to read the the file slice of interest. 
Eventually during execution, it might parse the log block header and might skip 
if its partially failed commit or inflight commit. Anyways, in short, we can't 
afford to delete any log files at any point in time except cleaner. So, for any 
extraneous log files detected, we fix the HoodieCommitMetadata to track these 
additional log files as well.

 

 

Here is what we might want to do in 1.x :

We are looking to fix two problems by adding per log file marker.
a. identify log files to be deleted for failed writs. 
b. If there are spurious log files created even w/ successful commits, we need 
to ensure these spurious log files are also synced to MDT.

So, to fix all of the above, we are adding per log file marker. Any log file 
added or appended to will create markers. We don't really need to distinguish 
between create and append and so we will go w/ APPEND IoType for markers.

Fix for (a): Any log file added will emit a marker. If the commit of interest 
failed, hudi will trigger a rollback. During rollback planning, using markers 
we identify the original log files added by the failed commit and track it as 
part of the rollback plan. and the log files will be deleted during rollback. 

So, even if original dc was not synced to MDT, whenever rollback syncs to MDT, 
it might trigger a delete of a non-existant log file (from MDT's view point) 
and we should be good. 

Even if rollback goes through multiple attempts in data table, we do not have 
any issue as we have in 0.x since rollback in 1.0 will not add a new log file 
in data table, but will only delete files belonging to partially failed writes. 

Fix for (b): During marker based reconciliation step, we check for log files 
from markers and compare it against HoodieCommitMetadata's HoodieWriteStat. If 
for any additional files tracked using markers (which could happen due to spark 
retries), we just delete them similar to how we delete spurious base files. 

 

So, in summary, all we need is to ensure each log file has an individual marker 
generated. And during marker based reconciliation, any spurious log files will 
also be deleted. 

 

 

 

> Create per log file marker for 1.x 
> -----------------------------------
>
>                 Key: HUDI-8282
>                 URL: https://issues.apache.org/jira/browse/HUDI-8282
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: writer-core
>            Reporter: sivabalan narayanan
>            Assignee: sivabalan narayanan
>            Priority: Major
>             Fix For: 1.0.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> This ticket is meant for 1.x. We added per log file marker which was added to 
> 0.x branch https://issues.apache.org/jira/browse/HUDI-1517 
>  
> Commit of interest: 
> [https://github.com/apache/hudi/commit/c2c7e0538f8cf3031781ebdd776d1c03bfec3bb3]
>  
>  
> We might need to add a similar support to 1.x, hence the tracking ticket. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to