[I] [Feature][Connectors] Abnormal Data Logging [seatunnel]

via GitHub Sat, 09 Nov 2024 18:32:56 -0800


Ivan-gfan opened a new issue, #8005:
URL: https://github.com/apache/seatunnel/issues/8005


   ### Search before asking
   
   - [X] I had searched in the 
[feature](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22)
 and found no similar feature requirement.
   
   
   ### Description
   
   # Description:
   > Currently, there are no metrics for tracking abnormal data records, nor is 
there an option to ignore exceptions and continue execution. Regardless of 
whether JDBC or other data sources are used, any error encountered during 
insertion will terminate the application, which is not user-friendly.
   
   Suggested Improvements:
   
   ## 1. Abnormal Data Metrics:
   The final metrics should include not only the read and write counts but also 
the count of abnormal data. The sum of abnormal data and successful write 
counts should equal the total read count.
   
   ## 2. Detailed Abnormal Record Entity:
   Introduce a domain entity to record detailed information about abnormal 
records. This entity should include:
   
   - The identifier of the erroneous column.
   - The name of the column.
   - The erroneous data content.
   - The reason for the error.
   
   ## 3. Batch Submission Handling:
   Some connectors may use batch submission to improve performance, relying on 
the transaction management of the target data source (e.g., the `batch_size` 
parameter in the JDBC connector). Users must balance their tolerance for 
record-level granularity.
   
   - If a batch contains an erroneous record, the entire batch will typically 
fail. As a result, the error count will accumulate in multiples of `batch_size`.
   
   - For precise error tracking, users would need to set `batch_size` to 1, but 
this compromises performance.
   
   - Conversely, for high-performance batch submissions, error tracking becomes 
less accurate.
   This trade-off needs to be managed by the user based on their specific use 
case, but the system should provide the necessary functionality.
   
   ## 4. Planned Total Record Count in Metrics:
   It would be beneficial to include the total planned record count in the 
metrics (e.g., the result of `SELECT COUNT(*) FROM source`).
   
   - This would enable the implementation of a progress bar when using batch 
processing.
   
   - Currently, the metrics only show the cumulative read and write counts at 
the current time but do not include the total planned count for the entire task.
   
   ### Usage Scenario
   
   # 1. Precise Error Row Counting and Detailed Error Information
   
   - The system should be able to accurately count the number of error rows.
   - For each error, detailed information should be recorded, including:
     - The specific row that caused the error.
     - The erroneous column identifier and name.
     - The erroneous data content.
     - The reason for the error.
   
   # 2. Incremental Synchronization
   
   - The solution must support incremental data synchronization.
   - Errors encountered during synchronization should not halt the entire 
process.
   
   # 3. User Display
   
   - Users should be able to view a summary of the synchronization process, 
including error statistics and detailed error records.
   
   # 4. Key Pain Points:
   - Task Termination on Error:
   During large-scale data synchronization, a single error can cause the task 
to terminate abruptly.
     - This is especially frustrating when earlier successful transactions have 
already been committed to the database.
     - Users are left with incomplete data and have to restart or manually 
reconcile the process.
   # 5. Desired Behavior:
   - The task should not terminate immediately upon encountering an error.
   - Instead, errors should be logged, and the synchronization process should 
continue.
   - At the end of the task, a comprehensive report should be available for 
users, showing:
     - Total records processed.
     - Successful writes.
     - Errors, including their details.
     - Metrics for read, write, and error counts.
   
   This approach would improve user experience and ensure data integrity while 
allowing users to handle errors post-synchronization.
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Feature][Connectors] Abnormal Data Logging [seatunnel]

Reply via email to