[PR] [Arrow] Column-major segment build for Arrow IPC sources [pinot]

via GitHub Fri, 29 May 2026 18:59:22 -0700


real-mj-song opened a new pull request, #18638:
URL: https://github.com/apache/pinot/pull/18638


   **TL;DR:** Adds `ArrowColumnReaderFactory` (caller-managed reader/allocator) 
and `ArrowFileColumnReaderFactory` (file convenience) to the `pinot-arrow` 
plugin, exposing Arrow IPC sources to the column-major build path introduced in 
#16727. Stacked on top of #18632.
   
   Tracks #18629.
   
   ### Depends on
   
   **#18632** (precursor PR) — extracts `ArrowToPinotTypeConverter` from 
`ArrowRecordExtractor` into a public shared utility. Must merge before this PR, 
since the new `ArrowColumnReader.getValue(int)` delegates to that utility.
   
   The diff currently shows both commits (precursor + this work); once #18632 
lands on `master`, GitHub will reduce the diff to just this PR's commit.
   
   ### What changed
   
   - **`ArrowColumnReaderFactory`** — implements `ColumnReaderFactory` over a 
caller-managed `ArrowReader` and `BufferAllocator`. The factory does not close 
them. Accepts any `ArrowReader` subclass — `ArrowFileReader`, 
`ArrowStreamReader`, or a custom subclass that yields in-process record 
batches. Suitable for batch pipelines that produce Arrow batches in-process 
(e.g. a Spark task using `mapInArrow`) and want to drive `buildColumnar()` 
without disk I/O.
   
   - **`ArrowFileColumnReaderFactory`** — file-specialised convenience that 
opens a private `RootAllocator` (sized by the `arrowAllocatorLimit` config), 
opens the IPC file via `ArrowFileReader`, and closes all three on `close()`. 
The natural choice for callers that just want to read an Arrow file on disk.
   
   - **`ArrowColumnReader`** — wraps a single Arrow `FieldVector` and 
implements the three documented `ColumnReader` access patterns: generic 
sequential `next()` with null checks, typed sequential (`nextInt()` / 
`nextLong()` / …) with `isNextNull()` / `skipNext()`, and random access by 
`docId`. Single-value and multi-value (`List` of primitive) variants supported.
   
   - **`ArrowAccumulators`** — package-private helper shared by both factories. 
Walks every record batch and bulk-appends each wanted column into a per-column 
`FieldVector` accumulator via Arrow's `VectorAppender` (Visitor-based; grows 
buffers once per batch and bulk-copies, instead of per-row `copyValueSafe`). 
Rejects dictionary-encoded columns loudly via `Preconditions.checkArgument` 
since the column-major path doesn't yet implement `DictionaryEncoder.decode`. 
Also exposes a `closeAll(map)` utility shared by both factory `close()` paths.
   
   `ArrowColumnReader.getValue(int docId)` delegates Arrow → Pinot type 
conversion to the shared `ArrowToPinotTypeConverter` (from #18632), returning 
canonical JDK types — `String` from `Utf8` / `LargeUtf8` (unwrapped from 
Arrow's `Text`), `Object[]` for `List` variants (with recursive element 
conversion), `LocalDate` / `LocalTime` / `Timestamp` for temporal types, 
`BigDecimal` for `Decimal`. Matches the row-major path's contract so per-column 
stats collectors, dictionary creators, and index creators downstream of 
`buildColumnar()` see the same JDK types they see today.
   
   ### Compatibility
   
   Purely additive on the SPI side:
   
   - The existing row-major `ArrowRecordReader` path is unchanged.
   - Users opt in to the column-major path by constructing the driver via the 
`ColumnReaderFactory`-accepting overload of 
`SegmentIndexCreationDriverImpl.init`.
   - No SPI changes — consumes the already-merged `ColumnReader` / 
`ColumnReaderFactory` interfaces from #16727.
   
   ### Testing
   
   | Suite | Tests |
   |---|---|
   | `ArrowColumnReaderFactoryTest` (caller-managed path via 
`ArrowStreamReader`) | 3 / 3 |
   | `ArrowFileColumnReaderFactoryTest` (file path) | 7 / 7 |
   | `ArrowColumnReaderStringTypeTest` (Text → String unwrap regression) | 1 / 
1 |
   | `ArrowColumnarBuildIntegrationTest` — row-major vs file-columnar segment 
metadata equivalence | 1 / 1 |
   | Row-major regression (`ArrowRecordExtractorTest` 54 + 
`ArrowMessageDecoderTest` 7 + `ArrowRecordReaderTest` 2) | 63 / 63 |
   
   The integration test builds a Pinot segment from the same Arrow IPC file via 
both paths (row-major via `ArrowRecordReader → driver.build()`; column-major 
via `ArrowFileColumnReaderFactory → driver.build() → buildColumnar()`) and 
asserts per-column cardinality, min, max, totalDocs, and data type match, plus 
segment-level doc count match.
   
   ### Arrow caveat: type assumptions downstream of `buildColumnar()`
   
   Several downstream cast sites in `pinot-segment-local` — for example 
`StringColumnPreIndexStatsCollector.collect` and 
`NoDictColumnStatisticsCollector.collect` — cast directly to `String` / 
`Integer` without type checks, and NPE on `null` entries. This holds for the 
existing segment-to-segment consumer of `buildColumnar()` but is fragile under 
any non-segment source. The shared converter satisfies the type contract today 
(e.g. unwraps Arrow's `Text` to `String`); the underlying assumptions are out 
of scope for this PR and warrant a follow-up to tighten them once a non-segment 
consumer exists in tree.
   
   ### References
   
   - Tracking issue: #18629
   - Precursor PR (must merge first): #18632
   - `ColumnReader` SPI: #16727
   - Row-major `ArrowRecordExtractor` refactor: #18434


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [Arrow] Column-major segment build for Arrow IPC sources [pinot]

Reply via email to