neuyilan commented on PR #4844:
URL: https://github.com/apache/paimon/pull/4844#issuecomment-2575686738

   
   
![image](https://github.com/user-attachments/assets/2f4943d7-a0df-479e-a92f-a6e771c71460)
   
   Hi, Jingsong,  according to the original design[1] and the above discussion, 
I plan to refactore to the following Flink batch job. 
   1. The first stage is responsible for pick the tables need cloned.If the 
database parameter is not passed, then all tables of all databases will be 
cloned.If the table parameter is not passed, then all tables of the database 
will be cloned. (not changed, the same as the original design).
   2. The second stage pick related files(Snapshot, Schema, ManifestList, 
Manifest, Datafile, ChangeLog, IndexFile) of the snapshot in source table.(not 
changed, the same as the original design).
   3. The thrid stage is only copy the schema files to the target path. the 
schema files contains: Snapshot, Schema, ManifestList and IndexFile. 
   4. The fourth stage mainly involves copying or rewriting the manifest file 
in distributed parallelism. If it is an external path, rewrite it; otherwise, 
copy it.
   5. Shuffle the data file by the filename.(data file contains  Datafile and 
ChangeLog).
   6. The fifth stage is copy the data files in distributed parallelism. 
   7. Shuffle by the target's table name to next stage.
   8. The sixth stage is recreate the snapshot hint file. (not changed, the 
same as the original design).
   
   Please help confirm if this refactoring is appropriate, Thanks.
   
   [1] 
https://cwiki.apache.org/confluence/display/PAIMON/PIP-18%3A+Introduce+clone+Action+and+Procedure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to