[PR] Optimize ProtoBufRecordExtractor with field descriptor caching [pinot]

via GitHub Wed, 28 Jan 2026 23:20:25 -0800


arunkumarucet opened a new pull request, #17593:
URL: https://github.com/apache/pinot/pull/17593


   ### Summary
   
   This PR optimizes `ProtoBufRecordExtractor` performance by introducing field 
descriptor caching and reducing object allocation during protobuf message 
extraction.
   
   ### Problem
   
   During high-throughput protobuf ingestion, the 
`ProtoBufRecordExtractor.extract()` method was repeatedly calling 
`descriptor.findFieldByName()` for every field on every message, causing 
unnecessary CPU overhead. Additionally, a new `ProtoBufFieldInfo` object was 
being allocated for each field extraction.
   
   ### Solution
   
   1. **Field Descriptor Caching**: Cache field descriptors on first message 
extraction, eliminating repeated `findFieldByName()` lookups
   2. **Reusable ProtoBufFieldInfo**: Reuse a single `ProtoBufFieldInfo` 
instance for top-level field extraction to reduce GC pressure
   3. **Schema Change Detection**: Detect descriptor changes via 
`descriptor.getFullName()` comparison to handle schema evolution safely
   4. **Bug Fix**: Fixed `_extractAll` not being reset to `false` on 
re-initialization with subset fields
   
   ### Benchmark Results
   
   | Benchmark | Mode | Throughput (ops/s) | Speedup |
   |-----------|------|-------------------:|--------:|
   | **extractOnly** (with caching) | all_fields | 1,157,051 | baseline |
   | **extractWithoutCaching** | all_fields | 969,656 | - |
   | **Caching Benefit** | all_fields | - | **1.19x** |
   | | | | |
   | **extractOnly** (with caching) | subset_5_fields | 5,412,669 | baseline |
   | **extractWithoutCaching** | subset_5_fields | 1,538,076 | - |
   | **Caching Benefit** | subset_5_fields | - | **3.52x** |
   | | | | |
   | **extractOnly** (with caching) | single_field | 19,467,735 | baseline |
   | **extractWithoutCaching** | single_field | 4,788,686 | - |
   | **Caching Benefit** | single_field | - | **4.07x** |
   
   **Key Takeaway**: Field descriptor caching provides **3.5x-4x speedup** for 
subset field extraction, which is the common case in production where tables 
only need specific fields from protobuf messages.
   
   ### Changes
   
   **Core Optimization:**
   - 
`pinot-plugins/pinot-input-format/pinot-protobuf/src/main/java/org/apache/pinot/plugin/inputformat/protobuf/ProtoBufRecordExtractor.java`
   - 
`pinot-plugins/pinot-input-format/pinot-protobuf/src/main/java/org/apache/pinot/plugin/inputformat/protobuf/ProtoBufFieldInfo.java`
   
   **Tests:**
   - 
`pinot-plugins/pinot-input-format/pinot-protobuf/src/test/java/org/apache/pinot/plugin/inputformat/protobuf/ProtoBufRecordExtractorCachingTest.java`
 (NEW - 13 tests)
   
   **Benchmark:**
   - 
`pinot-perf/src/main/java/org/apache/pinot/perf/BenchmarkProtoBufRecordExtractor.java`
 (NEW)
   
   ### Testing
   
   - All 208 existing protobuf tests pass
   - Added 13 new functional tests for caching behavior, schema change 
detection, and edge cases
   - JMH benchmark added to pinot-perf module


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Optimize ProtoBufRecordExtractor with field descriptor caching [pinot]

Reply via email to