neuyilan commented on PR #4844: URL: https://github.com/apache/paimon/pull/4844#issuecomment-2579130192
 Hi, @JingsongLi , thanks again for advice, and I have refactored to the following Flink batch job, please review it again. Thanks. 1. The first stage is responsible for pick the tables need cloned.If the database parameter is not passed, then all tables of all databases will be cloned.If the table parameter is not passed, then all tables of the database will be cloned. (not changed, the same as the original design). 2. The second stage just pick the schema files and copy it to the target path, the schema file contains Snapshot, Schema, ManifestList and IndexFile. 3. The thrid stage just pick the mainifest file in single parallelism. 4. The fourth stage mainly involves copying or rewriting the manifest file in distributed parallelism. If it is an external path, rewrite it; otherwise, copy it. 5. The fifth stage is picking all the data files in single parallelism. (data file contains Datafile and ChangeLog). 6. Shuffle the data file by the filename. 7. The sixth stage is copy the data files in distributed parallelism. 8. Shuffle by the target's table name to next stage. 9. The seventh stage is recreate the snapshot hint file. (not changed, the same as the original design). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
