luchunliang opened a new issue, #10022: URL: https://github.com/apache/inlong/issues/10022
### Describe the proposal ## Motivation Currently, InLong data stream is responsible for the access and distribution of full-scale data, and the data synchronization supports a certain level of data filtering and format conversion capabilities. By adding data filtering, format conversion, and enriching data content capabilities in the access and distribution links of InLong data stream, the data quality distributed to downstream systems can be improved. In addition, besides InLong's existing real-time integration and offline integration, further improvement of the final data quality can be achieved by performing internal secondary integration on InLong data stream. In order to enrich the types of data protocols accessed by InLong and improve the effectiveness of ETL extraction, it is necessary to support the Transform capability of data sources; In order to improve the processing performance of data synchronization and reduce costs, data synchronization jobs need to support multiple synchronization tasks within a single job; In order to improve the user experience of defining data synchronization task conversion logic, the interface provides pre-conversion operations for the original data, verifying the correctness of the conversion logic configuration. ## Solution The offline synchronization feature of the InLong dataset integration provides sources and sinks for processing data, corresponding to data sources and destinations, and combines with the scheduling system to synchronize full or incremental data from the data source to the data target. InLong supports scheduling offline synchronization tasks by setting specific trigger times(including year, month, day, hour, and minute) through the scheduling system. Offline synchronization tasks are created by the Manager (including scheduling information), and the specific data synchronization logic is implemented through the InLong Sort module. ### Logical Architecture  ### Key Competency **Job Configuration**: Support Wizard Mode(Configuration through page wizard) and OpenAPI mode. **Scheduling Configuration**: Support Wizard Mode(Configuration through page wizard) and OpenAPI mode **Job Type**: Support Periodic Incremental Synchronization and Periodic Full Synchronization **Scheduling**: Built-in simple periodic scheduling capability, complex capabilities such as task dependencies are supported by third-party scheduling systems. **Data Source:** RMDB, Message Queue and Big data storage(Hive,StarRocks,Iceberg etc.) **Data Sink**: RMDB, Message Queue and Big data storage(Hive,StarRocks,Iceberg etc.) **Compute Engine**: Flink **Offline Job Operation and Maintenance**: Job start,stop and running status monitoring **Special Handling**: Dirty Data Processing Capability ### Data Flow Architecture  1. The user creates an offline synchronization task. 2. The manager saves task information and scheduling information in the DB. 3. After task approval, the offline synchronization task information is encapsulated. 4. Register scheduling information with the scheduling system; InLong has a built-in simple scheduling solution (Quartz), while complete scheduling capabilities rely on third-party scheduling systems (DolphinScheduler, US, etc.). 5. The scheduling system regularly generates scheduling instances. 6. For the initial run, the manager constructs a Flink batch job. 7. Submit the Flink batch job to the Flink cluster. ## Task list ### new dev branch Since this is a big feature for InLong, so, create a new branch for development, and after development and testing are completed, merge it back to master. - [x] create new dev branch https://github.com/apache/inlong/tree/dev-offline-sync ### Manager Offline Synchronization Task Management: Definition and Management of Offline Synchronization Tasks - [x] #9781 - [x] #9813 - [ ] Page wizard mode - [x] OpenAPI mode(already covered by existed OpenAPI) - [x] #9862 Scheduling Management: Scheduling task definition, scheduling instance definition, scheduling task management (CRUD) - [ ] Definition of scheduling information, corresponding to each offline task - [ ] Scheduling information management - [ ] Page wizard mode - [ ] OpenAPI mode - [ ] Support for periodic scheduling capability - [ ] Scheduling instance definition - [ ] Scheduling interface abstraction - [ ] Plugin-based scheduling framework support - [ ] Built-in scheduling capability support (based on Quartz) - [ ] DolphinScheduler, US, etc. Offline Task Submission - [ ] Timing of Flink task submission determined by the scheduling system; submit Flink task when generating scheduling instance Offline Task Operation and Maintenance - [ ] Start (task submission), stop - [ ] Retrieve running status - [ ] Task logs, exceptions ### Sort Flink Task Encapsulation: Add support for Flink environment in batch mode Flink Batch Capability Support - [ ] Support Flink 1.18, upgrade Flink dependencies - [ ] Support Flink 1.18 connectors, connectors support batch mode operation ### InLong Component Other for not specified component ### Are you willing to submit PR? - [X] Yes, I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) ### Task list Waiting ### InLong Component InLong Agent ### Are you willing to submit PR? - [X] Yes, I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@inlong.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org