yaojiejia opened a new pull request, #18458:
URL: https://github.com/apache/hudi/pull/18458

   
   ### Describe the issue this Pull Request addresses
   
   Closes #15942
   
   When HoodieStreamer is configured to ingest JSON records, a single invalid 
record (e.g., plain text instead of valid JSON) causes the entire pipeline to 
crash with a `HoodieJsonToAvroConversionException`. The only existing 
workaround is configuring the full error table infrastructure, which is 
heavyweight for users who simply want to skip bad records and continue 
processing.
   
   ### Summary and Changelog
   
   Adds a new config `hoodie.streamer.source.json.skip.invalid.records` 
(default `false`) that allows HoodieStreamer to skip invalid JSON records 
instead of crashing. When enabled, bad records are logged at WARN level and 
dropped. This reuses the existing safe conversion methods (`fromJsonWithError`, 
`fromJsonToRowWithError`) that were previously only available when the error 
table was configured.
   
   - Added `SKIP_INVALID_JSON_RECORDS` config property to `HoodieStreamerConfig`
   - Modified `SourceFormatAdapter.transformJsonToGenericRdd()` to skip bad 
records during Avro conversion when config is enabled
   - Modified `SourceFormatAdapter.transformJsonToRowRdd()` to skip bad records 
during Row conversion when config is enabled
   - Modified `SourceFormatAdapter.fetchNewDataInRowFormat()` to use Spark's 
PERMISSIVE mode to skip corrupt records in the Spark JSON read path when config 
is enabled
   - Added 3 tests: skip in Avro format, skip in Row format, and default crash 
behavior preserved
   
   ### Impact
   
   New user facing config: `hoodie.streamer.source.json.skip.invalid.records`
   - Default: `false` (no behavior change for existing users)
   - When `true`: invalid JSON records are skipped with WARN-level logging 
instead of crashing the pipeline
   
   No breaking changes. No public API changes. No performance impact for users 
who do not enable the config.
   
   ### Risk Level
   
   Low. The feature is opt in and default behavior is unchanged. 
   
   ### Documentation Update
   
   None, will open another PR to update the documentation if the current PR is 
merged
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to