Ivan-gfan opened a new issue, #8005: URL: https://github.com/apache/seatunnel/issues/8005
### Search before asking - [X] I had searched in the [feature](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22) and found no similar feature requirement. ### Description # Description: > Currently, there are no metrics for tracking abnormal data records, nor is there an option to ignore exceptions and continue execution. Regardless of whether JDBC or other data sources are used, any error encountered during insertion will terminate the application, which is not user-friendly. Suggested Improvements: ## 1. Abnormal Data Metrics: The final metrics should include not only the read and write counts but also the count of abnormal data. The sum of abnormal data and successful write counts should equal the total read count. ## 2. Detailed Abnormal Record Entity: Introduce a domain entity to record detailed information about abnormal records. This entity should include: - The identifier of the erroneous column. - The name of the column. - The erroneous data content. - The reason for the error. ## 3. Batch Submission Handling: Some connectors may use batch submission to improve performance, relying on the transaction management of the target data source (e.g., the `batch_size` parameter in the JDBC connector). Users must balance their tolerance for record-level granularity. - If a batch contains an erroneous record, the entire batch will typically fail. As a result, the error count will accumulate in multiples of `batch_size`. - For precise error tracking, users would need to set `batch_size` to 1, but this compromises performance. - Conversely, for high-performance batch submissions, error tracking becomes less accurate. This trade-off needs to be managed by the user based on their specific use case, but the system should provide the necessary functionality. ## 4. Planned Total Record Count in Metrics: It would be beneficial to include the total planned record count in the metrics (e.g., the result of `SELECT COUNT(*) FROM source`). - This would enable the implementation of a progress bar when using batch processing. - Currently, the metrics only show the cumulative read and write counts at the current time but do not include the total planned count for the entire task. ### Usage Scenario # 1. Precise Error Row Counting and Detailed Error Information - The system should be able to accurately count the number of error rows. - For each error, detailed information should be recorded, including: - The specific row that caused the error. - The erroneous column identifier and name. - The erroneous data content. - The reason for the error. # 2. Incremental Synchronization - The solution must support incremental data synchronization. - Errors encountered during synchronization should not halt the entire process. # 3. User Display - Users should be able to view a summary of the synchronization process, including error statistics and detailed error records. # 4. Key Pain Points: - Task Termination on Error: During large-scale data synchronization, a single error can cause the task to terminate abruptly. - This is especially frustrating when earlier successful transactions have already been committed to the database. - Users are left with incomplete data and have to restart or manually reconcile the process. # 5. Desired Behavior: - The task should not terminate immediately upon encountering an error. - Instead, errors should be logged, and the synchronization process should continue. - At the end of the task, a comprehensive report should be available for users, showing: - Total records processed. - Successful writes. - Errors, including their details. - Metrics for read, write, and error counts. This approach would improve user experience and ensure data integrity while allowing users to handle errors post-synchronization. ### Related issues _No response_ ### Are you willing to submit a PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
