NamsB7 opened a new pull request, #4121:
URL: https://github.com/apache/gobblin/pull/4121

   Dear Gobblin maintainers,
   
   Please accept this PR. I understand that it will not be reviewed until I 
have checked off all the steps below!
   
   
   ### JIRA
   - [x] My PR addresses the following [Gobblin 
JIRA](https://issues.apache.org/jira/browse/GOBBLIN-2211) issues and references 
them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
       - https://issues.apache.org/jira/browse/GOBBLIN-2211
   
   
   ### Description
   - [x] Here are some details about my PR, including screenshots (if 
applicable):
   - Introduces an error classification system for Gobblin jobs, enabling 
automatic priority-based categorization job failures based on configurable 
error pattern descriptions.
   - Modified job execution flow to classify errors on job failure
   - Integrated with existing JobIssueEventHandler for error logging
   - No breaking changes to existing functionality
   - Classification only runs on job failures when enabled
   
   #### Why These Changes
   - Provides intelligent error analysis by matching failures against
     known patterns
   - Consolidates multiple errors into a single, prioritized final error with 
enriched context
   - Suggests probable root causes rather than definitive diagnosis
   - Appends pattern-matched summaries to enhance error visibility
   - Reduces time spent manually correlating similar failures
   - Builds organizational knowledge of common failure patterns
   
   ##### Key features
   - Pattern-based Classification: Uses regex patterns to match and categorize 
errors
   - Priority-based Selection: Returns the highest priority error when multiple 
patterns match
   - Extensible Storage: Pluggable storage backend (in-memory, MySQL, etc.)
   - Performance Optimized: Early stopping mechanism to avoid unnecessary 
pattern matching
   - Configurable: Dynamic SQL column sizes and table names
   
   - This change introduces two major new components:
       - ErrorClassifier: New class defining the contract for error 
classification
       - ErrorPatternStore: New interface for managing error pattern 
persistence 
   
   ##### Configurations
   - `error.regex.db.table.key`: MySQL table for error patterns
   - `error.categories.db.table.key`: MySQL table for error categories
   - `error.regex.max.varchar.size`: Configurable VARCHAR size for pattern 
storage
   - `error.category.max.varchar.size`: Configurable VARCHAR size for category 
names
   
   ###### Service Configurations
   - `errorPatternStore.class`:  Configures the pattern store impl
   - `errorClassification.enabled`: Toggles for classification. By deafult: 
disabled
   - `errorClassification.maxErrorsInFinal`: Caps final error count
   - `errorClassification.maxErrorsToProcess`: Limits errors processed for 
performance
   
   ### Tests
   - [x] My PR adds the following unit tests:
     - **ErrorClassifierTest**: Simple unit test suite checking, for a given 
list of issues with severity ERROR:
       - Pattern matching and classification logic
       - Priority-based error selection when multiple patterns match
       - Early stopping optimization
       - Edge cases (null/empty errors, no matches)
       - Integration with InMemory ErrorPatternStore implementation
   
   ### Commits
   - [x] My commits all reference JIRA issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
       1. Subject is separated from body by a blank line
       2. Subject is limited to 50 characters
       3. Subject does not end with a period
       4. Subject uses the imperative mood ("add", not "adding")
       5. Body wraps at 72 characters
       6. Body explains "what" and "why", not "how"
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to