Copilot commented on code in PR #18436:
URL: https://github.com/apache/pinot/pull/18436#discussion_r3203263693
##########
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/BaseRecordExtractor.java:
##########
@@ -18,46 +18,16 @@
*/
package org.apache.pinot.spi.data.readers;
-import com.google.common.collect.Maps;
-import java.math.BigDecimal;
-import java.math.BigInteger;
-import java.nio.ByteBuffer;
import java.sql.Timestamp;
import java.util.Base64;
-import java.util.Collection;
-import java.util.Map;
import java.util.Set;
import javax.annotation.Nullable;
import org.apache.commons.collections4.CollectionUtils;
-import org.apache.commons.lang3.ArrayUtils;
-/// Default [RecordExtractor] implementation. Subclasses override only the
bits the format needs.
-///
-/// [#convert] dispatches by shape:
-/// - multi-value (`Collection` / non-`byte[]` array) → [#convertMultiValue] →
`Object[]` (each element recursed)
-/// - map → [#convertMap] → `Map<String, Object>` (keys converted via
[#convertSingleValue] then stringified via
-/// [#stringifyMapKey], each value recursed)
-/// - nested record → [#convertRecord] (throws by default; override for
formats with nested records)
-/// - everything else → [#convertSingleValue]
-///
-/// [#convertSingleValue] applies universal normalizations for primitive types
that any format extractor might
-/// produce:
-/// - `Byte` / `Short` widen to `Integer` so all small ints unify behind a
single Pinot type
-/// - `BigInteger` widens to `BigDecimal` (Pinot has no `BigInteger` data
type; downstream transforms handle
-/// `BigDecimal` natively)
-/// - other `Number` (`Integer` / `Long` / `Float` / `Double` / `BigDecimal`)
passes through
-/// - `Boolean` passes through
-/// - `byte[]` passes through
-/// - `ByteBuffer` materializes to `byte[]` (sliced so the source buffer's
position is not advanced)
-/// - everything else falls back to `value.toString()`
-///
-/// **Logical types (DECIMAL / TIMESTAMP / DATE / TIME / UUID) are NOT handled
here** — see [RecordExtractor]
-/// for the contract. Format-specific extractors do the native-to-contract
conversion themselves (e.g. the
-/// Avro extractor walks the schema in its own `extract` and never reaches
this dispatcher).
+/// Default [RecordExtractor] base providing include-list resolution via
[#init] and the [#stringifyMapKey] helper.
///
/// @param <T> the format of the input record
-@SuppressWarnings({"rawtypes", "unchecked"})
public abstract class BaseRecordExtractor<T> implements RecordExtractor<T> {
Review Comment:
Removing the protected conversion dispatcher/helpers from
BaseRecordExtractor is a source/binary incompatible change for any external
RecordExtractor implementations that extend this SPI class and relied on
convert()/convertMap()/convertSingleValue(), etc. Since this class lives in
pinot-spi (published for external consumers), consider keeping the old methods
as `@Deprecated` (possibly moved to a utility and delegated to) for at least
one release cycle, or otherwise providing a compatibility layer/migration path.
##########
pinot-plugins/pinot-input-format/pinot-parquet/src/main/java/org/apache/pinot/plugin/inputformat/parquet/ParquetNativeRecordExtractor.java:
##########
@@ -51,32 +50,22 @@
/// - `BINARY` with `STRING` / `ENUM` annotation → `String`
/// - `BINARY` / `FIXED_LEN_BYTE_ARRAY` without annotation → `byte[]`
///
-/// **Logical types** — Parquet logical-type annotations decoded into uniform
Java types regardless of the
-/// underlying physical encoding:
-/// - `DECIMAL` (on `INT32` / `INT64` / `BINARY` / `FIXED_LEN_BYTE_ARRAY`) →
`BigDecimal`. Always converted —
-/// raw bytes aren't interpretable without external precision / scale
-/// - `INT64` + `TIMESTAMP_MILLIS` / `MICROS` / `NANOS` → `Timestamp`
(sub-millisecond nanos preserved via
-/// `Timestamp#setNanos`), or `Long` raw value in the column's declared unit
when raw
-/// - `INT96` → `Timestamp` (sub-millisecond nanos preserved), or `Long` epoch
nanos when raw (nanos is
-/// INT96's natural unit since its physical encoding — nanos-of-day + Julian
day — carries nanosecond
-/// precision)
-/// - `INT32` + `DATE` → `LocalDate`, or `Integer` raw days-since-epoch when
raw
-/// - `INT32` + `TIME_MILLIS` → `LocalTime`, or `Integer` raw
ms-since-midnight when raw
-/// - `INT64` + `TIME_MICROS` / `TIME_NANOS` → `LocalTime` (full nanosecond
precision), or `Long` raw
-/// value-since-midnight in the column's declared unit when raw
-/// - `FIXED_LEN_BYTE_ARRAY(16)` + `UUID` → `java.util.UUID`. Always
converted; the downstream type
-/// transformer adapts to the Pinot column's storage type — `STRING` →
canonical form, `BYTES` →
-/// 16-byte big-endian form
+/// **Logical types:**
+/// - `DECIMAL` (on `INT32` / `INT64` / `BINARY` / `FIXED_LEN_BYTE_ARRAY`) →
`BigDecimal`
+/// - `INT64` + `TIMESTAMP_MILLIS` / `MICROS` / `NANOS` → `Timestamp`, or
`Long` in the column's declared unit
+/// when `extractRawTimeValues` is `true`
+/// - `INT96` → `Timestamp`, or `Long` epoch nanos when `extractRawTimeValues`
is `true`
+/// - `INT32` + `DATE` → `LocalDate`, or `Integer` days-since-epoch when
`extractRawTimeValues` is `true`
+/// - `INT32` + `TIME_MILLIS` → `LocalTime`, or `Integer` ms-since-midnight
when `extractRawTimeValues` is `true`
+/// - `INT64` + `TIME_MICROS` / `TIME_NANOS` → `LocalTime`, or `Long`
value-since-midnight in the column's
+/// declared unit when `extractRawTimeValues` is `true`
+/// - `FIXED_LEN_BYTE_ARRAY(16)` + `UUID` → `java.util.UUID`
///
-/// [ParquetNativeRecordExtractorConfig#isExtractRawTimeValues] opts out of
TIMESTAMP / DATE / TIME conversion.
-/// DECIMAL and UUID always convert.
-///
-/// **Multi-value:** `LIST`-annotated group (standard 3-level wrapper or
legacy non-wrapper forms) → `Object[]`.
-///
-/// **Map / nested complex:** `MAP`-annotated group → `Map<Object, Object>`;
plain non-annotated group →
-/// `Map<String, Object>` (struct).
-///
-/// **Null:** field with zero repetition count → `null`.
+/// **Complex types:**
+/// - `LIST`-annotated group (standard 3-level wrapper or legacy non-wrapper
forms) → `Object[]`
+/// - `MAP`-annotated group → `Map<Object, Object>`
Review Comment:
The class-level type matrix says MAP-annotated groups convert to
`Map<Object, Object>`, but the implementation builds a `Map<String, Object>`
and stringifies keys via stringifyMapKey(). Please update the documentation to
reflect the actual contract/behavior (Map<String, Object>).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]