vinothchandar commented on code in PR #18259:
URL: https://github.com/apache/hudi/pull/18259#discussion_r2862112031


##########
rfc/rfc-100/rfc-100.md:
##########
@@ -201,6 +201,17 @@ To identify these references, we have three options:
 
 **Option 1 will be implemented in milestone 1.**
 
+**Implementation Details**:
+
+The main assumption for out-of-line, managed blobs is that they will be used 
once. This implies that the blob will not be used by multiple rows in the 
dataset. Similar once a row is updated to point to a new blob, the old blob 
will not be referenced anymore.
+
+The cleaner plan will remain the same but during the cleaner execution, we 
will search for blobs that are no longer referenced by iterating through the 
files being removed and creating a dataset of the managed, blob references 
contained in those files. Then we will create a dataset of the remaining blob 
references and use the `HoodieEngineContext` to left-join with the removed blob 
references to identify the unreferenced blobs. These unreferenced blobs will 
then be deleted from storage.
+The blob deletion must therefore happen before removing the files marked for 
deletion. If the cleaner crashes during execution, we should be able to re-run 
the plan in an idempotent manner. To account for this, we can skip any files 
that are already deleted when searching for de-referenced blobs.
+
+If global updates are enabled for the table, we will need to search through 
all the file slices since the data can move between partitions. If global 
updates are not enabled, we can limit the search with the following 
optimizations:
+- For files that are being removed but have a newer file slice for the file 
group, we can limit the search to files within the same file group.
+- For files that are being removed and do not have a newer file slice for the 
file group (this will occur during replace commits & clustering), we will need 
to inspect all the retained files in the partition that were created after the 
creation of the removed file slice since the data can move between file groups 
within the same partition.

Review Comment:
   Same as above, the problematic race, except R is a concurrent replacecommit 
that cleaning execution does not see.. 



##########
rfc/rfc-100/rfc-100.md:
##########
@@ -201,6 +201,17 @@ To identify these references, we have three options:
 
 **Option 1 will be implemented in milestone 1.**
 
+**Implementation Details**:
+
+The main assumption for out-of-line, managed blobs is that they will be used 
once. This implies that the blob will not be used by multiple rows in the 
dataset. Similar once a row is updated to point to a new blob, the old blob 
will not be referenced anymore.
+
+The cleaner plan will remain the same but during the cleaner execution, we 
will search for blobs that are no longer referenced by iterating through the 
files being removed and creating a dataset of the managed, blob references 
contained in those files. Then we will create a dataset of the remaining blob 
references and use the `HoodieEngineContext` to left-join with the removed blob 
references to identify the unreferenced blobs. These unreferenced blobs will 
then be deleted from storage.

Review Comment:
   yeah, agree. if its all correct, then we should limit the view. so its 
determinisitic and isolated. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to