neuyilan commented on PR #4844: URL: https://github.com/apache/paimon/pull/4844#issuecomment-2575686738
 Hi, Jingsong, according to the original design[1] and the above discussion, I plan to refactore to the following Flink batch job. 1. The first stage is responsible for pick the tables need cloned.If the database parameter is not passed, then all tables of all databases will be cloned.If the table parameter is not passed, then all tables of the database will be cloned. (not changed, the same as the original design). 2. The second stage pick related files(Snapshot, Schema, ManifestList, Manifest, Datafile, ChangeLog, IndexFile) of the snapshot in source table.(not changed, the same as the original design). 3. The thrid stage is only copy the schema files to the target path. the schema files contains: Snapshot, Schema, ManifestList and IndexFile. 4. The fourth stage mainly involves copying or rewriting the manifest file in distributed parallelism. If it is an external path, rewrite it; otherwise, copy it. 5. Shuffle the data file by the filename.(data file contains Datafile and ChangeLog). 6. The fifth stage is copy the data files in distributed parallelism. 7. Shuffle by the target's table name to next stage. 8. The sixth stage is recreate the snapshot hint file. (not changed, the same as the original design). Please help confirm if this refactoring is appropriate, Thanks. [1] https://cwiki.apache.org/confluence/display/PAIMON/PIP-18%3A+Introduce+clone+Action+and+Procedure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
