arunkumarucet opened a new pull request, #17593: URL: https://github.com/apache/pinot/pull/17593
### Summary This PR optimizes `ProtoBufRecordExtractor` performance by introducing field descriptor caching and reducing object allocation during protobuf message extraction. ### Problem During high-throughput protobuf ingestion, the `ProtoBufRecordExtractor.extract()` method was repeatedly calling `descriptor.findFieldByName()` for every field on every message, causing unnecessary CPU overhead. Additionally, a new `ProtoBufFieldInfo` object was being allocated for each field extraction. ### Solution 1. **Field Descriptor Caching**: Cache field descriptors on first message extraction, eliminating repeated `findFieldByName()` lookups 2. **Reusable ProtoBufFieldInfo**: Reuse a single `ProtoBufFieldInfo` instance for top-level field extraction to reduce GC pressure 3. **Schema Change Detection**: Detect descriptor changes via `descriptor.getFullName()` comparison to handle schema evolution safely 4. **Bug Fix**: Fixed `_extractAll` not being reset to `false` on re-initialization with subset fields ### Benchmark Results | Benchmark | Mode | Throughput (ops/s) | Speedup | |-----------|------|-------------------:|--------:| | **extractOnly** (with caching) | all_fields | 1,157,051 | baseline | | **extractWithoutCaching** | all_fields | 969,656 | - | | **Caching Benefit** | all_fields | - | **1.19x** | | | | | | | **extractOnly** (with caching) | subset_5_fields | 5,412,669 | baseline | | **extractWithoutCaching** | subset_5_fields | 1,538,076 | - | | **Caching Benefit** | subset_5_fields | - | **3.52x** | | | | | | | **extractOnly** (with caching) | single_field | 19,467,735 | baseline | | **extractWithoutCaching** | single_field | 4,788,686 | - | | **Caching Benefit** | single_field | - | **4.07x** | **Key Takeaway**: Field descriptor caching provides **3.5x-4x speedup** for subset field extraction, which is the common case in production where tables only need specific fields from protobuf messages. ### Changes **Core Optimization:** - `pinot-plugins/pinot-input-format/pinot-protobuf/src/main/java/org/apache/pinot/plugin/inputformat/protobuf/ProtoBufRecordExtractor.java` - `pinot-plugins/pinot-input-format/pinot-protobuf/src/main/java/org/apache/pinot/plugin/inputformat/protobuf/ProtoBufFieldInfo.java` **Tests:** - `pinot-plugins/pinot-input-format/pinot-protobuf/src/test/java/org/apache/pinot/plugin/inputformat/protobuf/ProtoBufRecordExtractorCachingTest.java` (NEW - 13 tests) **Benchmark:** - `pinot-perf/src/main/java/org/apache/pinot/perf/BenchmarkProtoBufRecordExtractor.java` (NEW) ### Testing - All 208 existing protobuf tests pass - Added 13 new functional tests for caching behavior, schema change detection, and edge cases - JMH benchmark added to pinot-perf module -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
