[I] [Umbrella] InLong Transform feature [inlong]

via GitHub Sat, 20 Apr 2024 02:01:54 -0700


luchunliang opened a new issue, #10022:
URL: https://github.com/apache/inlong/issues/10022


   ### Describe the proposal
   
   ## Motivation 
   Currently, InLong data stream is responsible for the access and distribution 
of full-scale data, and the data synchronization supports a certain level of 
data filtering and format conversion capabilities. By adding data filtering, 
format conversion, and enriching data content capabilities in the access and 
distribution links of InLong data stream, the data quality distributed to 
downstream systems can be improved. In addition, besides InLong's existing 
real-time integration and offline integration, further improvement of the final 
data quality can be achieved by performing internal secondary integration on 
InLong data stream.
   
   In order to enrich the types of data protocols accessed by InLong and 
improve the effectiveness of ETL extraction, it is necessary to support the 
Transform capability of data sources;
   
   In order to improve the processing performance of data synchronization and 
reduce costs, data synchronization jobs need to support multiple 
synchronization tasks within a single job;
   
   In order to improve the user experience of defining data synchronization 
task conversion logic, the interface provides pre-conversion operations for the 
original data, verifying the correctness of the conversion logic configuration.
   
   ## Solution
   The offline synchronization feature of the InLong dataset integration 
provides sources and sinks for processing data, corresponding to data sources 
and destinations, and combines with the scheduling system to synchronize full 
or incremental data from the data source to the data target.
   
   InLong supports scheduling offline synchronization tasks by setting specific 
trigger times(including year, month, day, hour, and minute) through the 
scheduling system. 
   
   Offline synchronization tasks are created by the Manager (including 
scheduling information), and the specific data synchronization logic is 
implemented through the InLong Sort module.
   
   ### Logical Architecture
   
![image](https://github.com/apache/inlong/assets/48062889/319469ac-c82b-4dfb-b858-5917a1bb6a89)
   
   ### Key Competency
   **Job Configuration**: Support Wizard Mode(Configuration through page 
wizard) and OpenAPI mode.
   
   **Scheduling Configuration**: Support Wizard Mode(Configuration through page 
wizard) and OpenAPI mode
   
   **Job Type**: Support Periodic Incremental Synchronization and  Periodic 
Full Synchronization
   
   **Scheduling**: Built-in simple periodic scheduling capability, complex 
capabilities such as task dependencies are supported by third-party scheduling 
systems.
   
   **Data Source:** RMDB, Message Queue and Big data 
storage(Hive,StarRocks,Iceberg etc.)
   
   **Data Sink**: RMDB, Message Queue and Big data 
storage(Hive,StarRocks,Iceberg etc.)
   
   **Compute Engine**: Flink
   
   **Offline Job Operation and Maintenance**: Job start,stop and running status 
monitoring
   
   **Special Handling**: Dirty Data Processing Capability
   ### Data Flow Architecture
   
![image](https://github.com/apache/inlong/assets/48062889/a0c83ac0-8011-4542-b311-ebe2d22dd141)
   1. The user creates an offline synchronization task.
   2. The manager saves task information and scheduling information in the DB.
   3. After task approval, the offline synchronization task information is 
encapsulated.
   4. Register scheduling information with the scheduling system; InLong has a 
built-in simple scheduling solution (Quartz), while complete scheduling 
capabilities rely on third-party scheduling systems (DolphinScheduler, US, 
etc.).
   5. The scheduling system regularly generates scheduling instances.
   6. For the initial run, the manager constructs a Flink batch job.
   7. Submit the Flink batch job to the Flink cluster.
   
   
   
   ## Task list
   
   ### new dev branch
   Since this is a big feature for InLong, so, create a new branch for 
development, and after development and testing are completed, merge it back to 
master.
   - [x] create new dev branch  
https://github.com/apache/inlong/tree/dev-offline-sync
   ### Manager
   Offline Synchronization Task Management: Definition and Management of 
Offline Synchronization Tasks
   
   - [x]  #9781 
   - [x]  #9813 
     - [ ] Page wizard mode
     - [x]  OpenAPI mode（already covered by existed OpenAPI）
   - [x] #9862  
   
   Scheduling Management: Scheduling task definition, scheduling instance 
definition, scheduling task management (CRUD)
   
   - [ ]  Definition of scheduling information, corresponding to each offline 
task
   - [ ] Scheduling information management
     - [ ]  Page wizard mode
     - [ ]  OpenAPI mode
   - [ ]  Support for periodic scheduling capability
     - [ ] Scheduling instance definition
     - [ ]  Scheduling interface abstraction
     - [ ]  Plugin-based scheduling framework support
       - [ ]  Built-in scheduling capability support (based on Quartz)
       - [ ]  DolphinScheduler, US, etc.
   
   Offline Task Submission
   
   - [ ] Timing of Flink task submission determined by the scheduling system; 
submit Flink task when generating scheduling instance
   
   Offline Task Operation and Maintenance
   
   - [ ]  Start (task submission), stop
   - [ ]  Retrieve running status
   - [ ]  Task logs, exceptions
   
   ### Sort
   Flink Task Encapsulation: Add support for Flink environment in batch mode
   
   Flink Batch Capability Support
   
   - [ ]  Support Flink 1.18, upgrade Flink dependencies
   - [ ]  Support Flink 1.18 connectors, connectors support batch mode operation
   
   ### InLong Component
   
   Other for not specified component
   
   ### Are you willing to submit PR?
   
   - [X] Yes, I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   
   
   ### Task list
   
   Waiting
   
   ### InLong Component
   
   InLong Agent
   
   ### Are you willing to submit PR?
   
   - [X] Yes, I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Umbrella] InLong Transform feature [inlong]

Reply via email to