Jackie-Jiang opened a new pull request, #18434: URL: https://github.com/apache/pinot/pull/18434
## Summary Introduce `ArrowRecordExtractor` (extends `BaseRecordExtractor`) with schema-driven dispatch by `ArrowTypeID`; drop the bespoke `ArrowToGenericRowConverter`. The reader and decoder bind reader-scoped state once via `setReader(ArrowReader)`, which caches the dictionary map and pre-resolves the include list against the `VectorSchemaRoot`'s field vectors. Add `ArrowRecordExtractorConfig` with `extractRawTimeValues` — matches the Avro / Parquet flag; `Date` / `Time` / `Timestamp` surface as raw `int` / `long` in the schema's unit instead of the contract Java type. `ArrowMessageDecoder.decode` now branches on row count: - `0` → `null` - `1` → fields populated directly into the destination - `>1` → wrapped under `GenericRow.MULTIPLE_RECORDS_KEY` ### Bug fixes vs the prior converter - `DateDayVector` returns `Integer` (not `LocalDateTime`); the old code cast unconditionally to `LocalDateTime` and would throw at runtime for `DateDay` columns. - `UInt2Vector` returns `Character` (not a `Number`); the old code passed it through unchanged, violating the `Int(16) → Integer` contract. - `UInt1Vector` was sign-extended (`200 → -56`) instead of zero-extended. - All three are now schema-aware (dispatch on `ArrowType.Int.getIsSigned()` / `ArrowType.Date.getUnit()`). ### Tests - New `ArrowRecordExtractorTest` covering every Arrow vector type, raw and contract modes, complex types (`List`, `Struct`, `Map`), dictionary encoding, and include-list filtering. Each test runs through a real `ArrowStreamWriter` → `ArrowStreamReader` IPC roundtrip so `setReader` is exercised against an actual `ArrowReader` (no mocks). - `ArrowMessageDecoderTest` slimmed to decoder-specific concerns (lifecycle, error handling, empty / single / multi-row batch shapes). - `ArrowRecordReaderTest` keeps the inherited `AbstractRecordReaderTest` round-trip; redundant filter test removed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
