NamsB7 opened a new pull request, #4121: URL: https://github.com/apache/gobblin/pull/4121
Dear Gobblin maintainers, Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below! ### JIRA - [x] My PR addresses the following [Gobblin JIRA](https://issues.apache.org/jira/browse/GOBBLIN-2211) issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR" - https://issues.apache.org/jira/browse/GOBBLIN-2211 ### Description - [x] Here are some details about my PR, including screenshots (if applicable): - Introduces an error classification system for Gobblin jobs, enabling automatic priority-based categorization job failures based on configurable error pattern descriptions. - Modified job execution flow to classify errors on job failure - Integrated with existing JobIssueEventHandler for error logging - No breaking changes to existing functionality - Classification only runs on job failures when enabled #### Why These Changes - Provides intelligent error analysis by matching failures against known patterns - Consolidates multiple errors into a single, prioritized final error with enriched context - Suggests probable root causes rather than definitive diagnosis - Appends pattern-matched summaries to enhance error visibility - Reduces time spent manually correlating similar failures - Builds organizational knowledge of common failure patterns ##### Key features - Pattern-based Classification: Uses regex patterns to match and categorize errors - Priority-based Selection: Returns the highest priority error when multiple patterns match - Extensible Storage: Pluggable storage backend (in-memory, MySQL, etc.) - Performance Optimized: Early stopping mechanism to avoid unnecessary pattern matching - Configurable: Dynamic SQL column sizes and table names - This change introduces two major new components: - ErrorClassifier: New class defining the contract for error classification - ErrorPatternStore: New interface for managing error pattern persistence ##### Configurations - `error.regex.db.table.key`: MySQL table for error patterns - `error.categories.db.table.key`: MySQL table for error categories - `error.regex.max.varchar.size`: Configurable VARCHAR size for pattern storage - `error.category.max.varchar.size`: Configurable VARCHAR size for category names ###### Service Configurations - `errorPatternStore.class`: Configures the pattern store impl - `errorClassification.enabled`: Toggles for classification. By deafult: disabled - `errorClassification.maxErrorsInFinal`: Caps final error count - `errorClassification.maxErrorsToProcess`: Limits errors processed for performance ### Tests - [x] My PR adds the following unit tests: - **ErrorClassifierTest**: Simple unit test suite checking, for a given list of issues with severity ERROR: - Pattern matching and classification logic - Priority-based error selection when multiple patterns match - Early stopping optimization - Edge cases (null/empty errors, no matches) - Integration with InMemory ErrorPatternStore implementation ### Commits - [x] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 2. Subject is limited to 50 characters 3. Subject does not end with a period 4. Subject uses the imperative mood ("add", not "adding") 5. Body wraps at 72 characters 6. Body explains "what" and "why", not "how" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@gobblin.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org