phet commented on PR #3575:
URL: https://github.com/apache/gobblin/pull/3575#issuecomment-1271153373

   > it seems as long as there is one new snapshot generate on source, we will 
go through all the data files available on source to do copy even there is only 
one new file added.
   
   actually we'll always go through the complete metadata on source to list 
every file reachable from at least one snapshot.  it's true we do that even 
when there's not any 'new', unreplicated snapshot.  actual copy however only 
happens for files that are not present on the destination.  further, we need 
not examine every file, to determine whether it exists on dest.  rather, thanks 
to the immutability of iceberg files, we may short-circuit evaluation of an 
entire subtree of the iceberg metadata, when the root (e.g. manifest-list or 
manifest) is found already to exist at dest.  for details, I've added the 
comment `// ALGO:` in `IcebergDataset.getFilePathsToFileStatus()`
   
   > How do we plan to handle the file deletion on source? i.e. expire snapshot 
operation?
   
   good question!  the answer is: distcp is not responsible.  instead we will 
expect reachability analysis and orphan file deletion to happen elsewhere.  a 
good candidate would be the destination catalog we'll eventually register the 
copied 'metadata.json' file with.  e.g. that catalog would hold the metadata 
version prior to the registration and could easily determine which snapshots 
'expire' from the act of replacing the older metadata file with the newer one 
(replication has copied from source)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to