Re: [PR] Simplify ColumnReader SPI: drop iterator API, replace type checks with getValueType() [pinot]

via GitHub Fri, 03 Jul 2026 13:35:49 -0700


Copilot commented on code in PR #18918:
URL: https://github.com/apache/pinot/pull/18918#discussion_r3521769663



##########
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/columntransformer/DataTypeColumnTransformerTest.java:
##########
@@ -211,7 +153,7 @@ public void testIsNoOpForMatchingIntMVTypes() {
         .build();
     FieldSpec fieldSpec = schema.getFieldSpecFor(COLUMN_NAME);
     TableConfig tableConfig = new 
TableConfigBuilder(TableType.OFFLINE).setTableName("testTable").build();
-    ColumnReader reader = mockColumnReader().multiValue().asInt().build();
+    ColumnReader reader = mockReader(PinotDataType.INT_ARRAY);
     DataTypeColumnTransformer transformer = new 
DataTypeColumnTransformer(tableConfig, fieldSpec, reader);

Review Comment:
   The MV isNoOp coverage only asserts INT/LONG/STRING/BYTES; it is missing 
explicit `FLOAT_ARRAY` and `DOUBLE_ARRAY` cases, even though the SPI-driven 
`getValueType() == destType` logic needs to work for those types too. Adding MV 
float/double isNoOp tests would help ensure coverage across all supported 
primitive MV types.



##########
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/ColumnReader.java:
##########
@@ -23,300 +23,110 @@
 import java.io.Serializable;
 import java.math.BigDecimal;
 import javax.annotation.Nullable;
-
-
-/**
- * The <code>ColumnReader</code> interface is used to read column data from 
various data sources
- * for columnar segment building. Unlike RecordReader which reads row-by-row, 
ColumnReader provides
- * column-wise access to data, enabling efficient columnar segment creation.
- *
- * <p>This interface provides 3 patterns optimised for different use cases
- * (Some implementations may not support all patterns):
- * <ul>
- *   <li>Sequential iteration over all values in a column using hasNext(), 
next() and rewind()</li>
- *   <li>Sequential iteration with type-specific methods (isInt(), isLong(), 
nextInt(), nextLong(), etc.)
- *        and null handling (isNextNull(), skipNext()) and supports hasNext() 
and rewind()</li>
- *   <li>Random access by document ID using getInt(docId), getLong(docId), 
etc. and isNull(docId) for null checks</li>
- * </ul>
- *
- * <p>Implementations should handle data type conversions and efficient 
column-wise data access patterns.
- *
- * <h2>Usage Patterns</h2>
- *
- * <p>There are three primary patterns for reading data from a ColumnReader:
- *
- * <h3>Pattern 1: Sequential Iteration with Generic next() and Null Checks</h3>
- * <p>This pattern uses the generic {@link #next()} method which returns 
Object and may return null.
- * Suitable when you need to handle arbitrary data types or when null handling 
is done on the return value.
- *
- * <pre>{@code
- * // Read all values in the column
- * while (columnReader.hasNext()) {
- *   Object value;
- *   try {
- *    value = columnReader.next();
- *   } catch (Exception e) {
- *    // Handle exception / log
- *    continue;
- *    }
- *   if (value != null) {
- *     // Process non-null value
- *     processValue(value);
- *   } else {
- *     // Handle null value
- *     handleNullValue();
- *   }
- * }
- *
- * // Rewind to read the column again
- * columnReader.rewind();
- *
- * // Second pass through the data
- * while (columnReader.hasNext()) {
- *   Object value = columnReader.next();
- *   if (value != null) {
- *     processValueAgain(value);
- *   }
- * }
- * }</pre>
- *
- * <h3>Pattern 2: Sequential Iteration with Type-Specific Methods and Explicit 
Null Checks</h3>
- * <p>This pattern uses {@link #isNextNull()} to check for nulls before 
calling type-specific methods
- * like {@link #nextInt()}, {@link #nextLong()}, etc. Use {@link #skipNext()} 
to advance past null values.
- * This is the preferred pattern when you know the column data type and want 
to avoid boxing overhead.
- * Before using this pattern, check the column data type using methods like 
{@link #isInt()}, {@link #isLong()}, etc.
- * If the data type does not match, fall back to Pattern 1 with {@link 
#next()}.
- * <pre>{@code
- * // Read all int values in the column, handling nulls
- * if (columnReader.isInt()) {
- *  while (columnReader.hasNext()) {
- *     if (columnReader.isNextNull()) {
- *       // Skip the null value
- *      columnReader.skipNext();
- *      handleNullValue();
- *     } else {
- *      // Read the primitive int value (no boxing)
- *      int value = columnReader.nextInt();
- *      processIntValue(value);
- *    }
- *  }
- * } else {
- *  // Fallback to Pattern 1 if not INT type
- * }
- *
- * }</pre>
- *
- * <h3>Pattern 3: Random Access by Document ID</h3>
- * <p>This pattern uses {@link #getTotalDocs()} to get the total number of 
documents, then uses
- * document ID-based accessors like {@link #getInt(int)}, {@link 
#getLong(int)}, etc. to read
- * specific values. Use {@link #isNull(int)} to check if a value is null 
before reading.
- * This pattern is useful when you need random access or want to process 
documents in a specific order.
- *
- * <pre>{@code
- *
- * // Random access example - read specific document IDs
- * int[] docIdsToRead = {5, 10, 15, 20};
- * for (int docId : docIdsToRead) {
- *   if (!columnReader.isNull(docId)) {
- *     int value = columnReader.getInt(docId);
- *     processSpecificDoc(docId, value);
- *   }
- * }
- *
- * // Read in reverse order
- * // Get the total number of documents
- * int totalDocs = columnReader.getTotalDocs();
- * for (int docId = totalDocs - 1; docId >= 0; docId--) {
- *   if (!columnReader.isNull(docId)) {
- *     int value = columnReader.getInt(docId);
- *     processReverseOrder(docId, value);
- *   }
- * }
- * }</pre>
- *
- * <h3>Choosing the Right Pattern</h3>
- * <ul>
- *   <li><b>Pattern 1</b>: Use when dealing with generic Object types or when 
you don't know
- *       the column type at compile time. Less efficient due to boxing.</li>
- *   <li><b>Pattern 2</b>: Use when you know the column type and want 
efficient sequential iteration
- *       with primitive types. Preferred for most columnar segment building 
scenarios.</li>
- *   <li><b>Pattern 3</b>: Use when you need random access, want to process 
documents in a specific
- *       order, or need to access the same document multiple times.</li>
- * </ul>
- *
- */
+import org.apache.pinot.spi.data.FieldSpec.DataType;
+import org.apache.pinot.spi.utils.PinotDataType;
+
+
+/// Reads the values of a single column by document ID, for columnar segment 
building. Where a `RecordReader` exposes a
+/// data source row-by-row, a `ColumnReader` exposes one column at a time: a 
caller reads every value of a column before
+/// moving to the next one. Building a segment one fully-materialized column 
at a time is cheaper than transposing row
+/// records when the source is already columnar (e.g. Arrow, Parquet).
+///
+/// Documents are addressed by a 0-based id, from 0 (inclusive) to 
[#getTotalDocs()] (exclusive). To read a column:
+/// 1. Call [#getValueType()] once. A non-`null` result names the type that 
can be read directly through the matching
+///    type-specific accessor; `null` means the column must be read through 
the boxed [#getValue(int)].
+/// 2. For each document id, call [#isNull(int)] first — type-specific 
accessors cannot represent null and must not be
+///    called for a null value.
+/// 3. Read each non-null value with the accessor chosen in step 1, e.g. 
[#getInt(int)] / [#getIntMV(int)] when
+///    [#getValueType()] is [PinotDataType#INT] / [PinotDataType#INT_ARRAY], 
or [#getValue(int)] otherwise.
+///
+/// Implementations perform any conversion between the source encoding and the 
returned Pinot type, are not required to
+/// be thread-safe, and must be released with [#close()] when no longer needed.
+///
+/// ```
+/// PinotDataType valueType = columnReader.getValueType();
+/// for (int docId = 0; docId < columnReader.getTotalDocs(); docId++) {
+///   if (columnReader.isNull(docId)) {
+///     continue;
+///   }
+///   if (valueType == PinotDataType.INT) {
+///     consumeInt(columnReader.getInt(docId));
+///   } else {
+///     consumeObject(columnReader.getValue(docId));
+///   }
+/// }
+/// ```
 public interface ColumnReader extends Closeable, Serializable {
 
-  /**
-   * Return <code>true</code> if more values remain to be read in this column.
-   * <p>This method should not throw exception. Caller is not responsible for 
handling exceptions from this method.
-   */
-  boolean hasNext();
+  /// Returns the name of the column read by this reader.
+  String getColumnName();
 
-  /**
-   * Get the next value in the column.
-   * <p>This method should be called only if {@link #hasNext()} returns 
<code>true</code>. Caller is responsible for
-   * handling exceptions from this method and skip the value if user wants to 
continue reading the remaining values.
-   *
-   * @return Next column value, or null if the value is null
-   * @throws IOException If an I/O error occurs while reading
-   */
+  /// Returns the [PinotDataType] whose type-specific accessors can read this 
column directly, or `null` if the column
+  /// has no such accessor and must be read through [#getValue(int)].
+  ///
+  /// The returned type encodes cardinality: a single-value column returns a 
scalar type read through the scalar
+  /// accessor (e.g. [PinotDataType#INT] via [#getInt(int)]), while a 
multi-value column returns the array type read
+  /// through the multi-value accessor (e.g. [PinotDataType#INT_ARRAY] via 
[#getIntMV(int)]). Columns whose stored type
+  /// has no dedicated accessor — such as BOOLEAN, TIMESTAMP, or JSON — return 
`null`.

Review Comment:
   The `getValueType()` Javadoc says BOOLEAN columns have no dedicated accessor 
and should return `null`, but Arrow-based implementations treat Arrow booleans 
(`BitVector` / `ArrowType.Bool`) as directly readable INT (0/1) and return 
`PinotDataType.INT`. This is a contract/documentation mismatch that could 
confuse implementers and consumers; please clarify the SPI Javadoc to match the 
actual behavior (or adjust Arrow readers to return `null` for booleans if that 
is the intended contract).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Simplify ColumnReader SPI: drop iterator API, replace type checks with getValueType() [pinot]

Reply via email to