Jackie-Jiang commented on code in PR #18436:
URL: https://github.com/apache/pinot/pull/18436#discussion_r3203341810
##########
pinot-plugins/pinot-input-format/pinot-parquet/src/main/java/org/apache/pinot/plugin/inputformat/parquet/ParquetNativeRecordExtractor.java:
##########
@@ -51,32 +50,22 @@
/// - `BINARY` with `STRING` / `ENUM` annotation → `String`
/// - `BINARY` / `FIXED_LEN_BYTE_ARRAY` without annotation → `byte[]`
///
-/// **Logical types** — Parquet logical-type annotations decoded into uniform
Java types regardless of the
-/// underlying physical encoding:
-/// - `DECIMAL` (on `INT32` / `INT64` / `BINARY` / `FIXED_LEN_BYTE_ARRAY`) →
`BigDecimal`. Always converted —
-/// raw bytes aren't interpretable without external precision / scale
-/// - `INT64` + `TIMESTAMP_MILLIS` / `MICROS` / `NANOS` → `Timestamp`
(sub-millisecond nanos preserved via
-/// `Timestamp#setNanos`), or `Long` raw value in the column's declared unit
when raw
-/// - `INT96` → `Timestamp` (sub-millisecond nanos preserved), or `Long` epoch
nanos when raw (nanos is
-/// INT96's natural unit since its physical encoding — nanos-of-day + Julian
day — carries nanosecond
-/// precision)
-/// - `INT32` + `DATE` → `LocalDate`, or `Integer` raw days-since-epoch when
raw
-/// - `INT32` + `TIME_MILLIS` → `LocalTime`, or `Integer` raw
ms-since-midnight when raw
-/// - `INT64` + `TIME_MICROS` / `TIME_NANOS` → `LocalTime` (full nanosecond
precision), or `Long` raw
-/// value-since-midnight in the column's declared unit when raw
-/// - `FIXED_LEN_BYTE_ARRAY(16)` + `UUID` → `java.util.UUID`. Always
converted; the downstream type
-/// transformer adapts to the Pinot column's storage type — `STRING` →
canonical form, `BYTES` →
-/// 16-byte big-endian form
+/// **Logical types:**
+/// - `DECIMAL` (on `INT32` / `INT64` / `BINARY` / `FIXED_LEN_BYTE_ARRAY`) →
`BigDecimal`
+/// - `INT64` + `TIMESTAMP_MILLIS` / `MICROS` / `NANOS` → `Timestamp`, or
`Long` in the column's declared unit
+/// when `extractRawTimeValues` is `true`
+/// - `INT96` → `Timestamp`, or `Long` epoch nanos when `extractRawTimeValues`
is `true`
+/// - `INT32` + `DATE` → `LocalDate`, or `Integer` days-since-epoch when
`extractRawTimeValues` is `true`
+/// - `INT32` + `TIME_MILLIS` → `LocalTime`, or `Integer` ms-since-midnight
when `extractRawTimeValues` is `true`
+/// - `INT64` + `TIME_MICROS` / `TIME_NANOS` → `LocalTime`, or `Long`
value-since-midnight in the column's
+/// declared unit when `extractRawTimeValues` is `true`
+/// - `FIXED_LEN_BYTE_ARRAY(16)` + `UUID` → `java.util.UUID`
///
-/// [ParquetNativeRecordExtractorConfig#isExtractRawTimeValues] opts out of
TIMESTAMP / DATE / TIME conversion.
-/// DECIMAL and UUID always convert.
-///
-/// **Multi-value:** `LIST`-annotated group (standard 3-level wrapper or
legacy non-wrapper forms) → `Object[]`.
-///
-/// **Map / nested complex:** `MAP`-annotated group → `Map<Object, Object>`;
plain non-annotated group →
-/// `Map<String, Object>` (struct).
-///
-/// **Null:** field with zero repetition count → `null`.
+/// **Complex types:**
+/// - `LIST`-annotated group (standard 3-level wrapper or legacy non-wrapper
forms) → `Object[]`
+/// - `MAP`-annotated group → `Map<Object, Object>`
Review Comment:
Doc fixed — `Map<Object, Object>` → `Map<String, Object>` with the
`stringifyMapKey` note. The implementation was already correct; the type matrix
entry was stale (a parallel bug to the `ThriftRecordExtractor` one fixed in an
earlier review pass — should have caught both at once).
##########
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/BaseRecordExtractor.java:
##########
@@ -18,46 +18,16 @@
*/
package org.apache.pinot.spi.data.readers;
-import com.google.common.collect.Maps;
-import java.math.BigDecimal;
-import java.math.BigInteger;
-import java.nio.ByteBuffer;
import java.sql.Timestamp;
import java.util.Base64;
-import java.util.Collection;
-import java.util.Map;
import java.util.Set;
import javax.annotation.Nullable;
import org.apache.commons.collections4.CollectionUtils;
-import org.apache.commons.lang3.ArrayUtils;
-/// Default [RecordExtractor] implementation. Subclasses override only the
bits the format needs.
-///
-/// [#convert] dispatches by shape:
-/// - multi-value (`Collection` / non-`byte[]` array) → [#convertMultiValue] →
`Object[]` (each element recursed)
-/// - map → [#convertMap] → `Map<String, Object>` (keys converted via
[#convertSingleValue] then stringified via
-/// [#stringifyMapKey], each value recursed)
-/// - nested record → [#convertRecord] (throws by default; override for
formats with nested records)
-/// - everything else → [#convertSingleValue]
-///
-/// [#convertSingleValue] applies universal normalizations for primitive types
that any format extractor might
-/// produce:
-/// - `Byte` / `Short` widen to `Integer` so all small ints unify behind a
single Pinot type
-/// - `BigInteger` widens to `BigDecimal` (Pinot has no `BigInteger` data
type; downstream transforms handle
-/// `BigDecimal` natively)
-/// - other `Number` (`Integer` / `Long` / `Float` / `Double` / `BigDecimal`)
passes through
-/// - `Boolean` passes through
-/// - `byte[]` passes through
-/// - `ByteBuffer` materializes to `byte[]` (sliced so the source buffer's
position is not advanced)
-/// - everything else falls back to `value.toString()`
-///
-/// **Logical types (DECIMAL / TIMESTAMP / DATE / TIME / UUID) are NOT handled
here** — see [RecordExtractor]
-/// for the contract. Format-specific extractors do the native-to-contract
conversion themselves (e.g. the
-/// Avro extractor walks the schema in its own `extract` and never reaches
this dispatcher).
+/// Default [RecordExtractor] base providing include-list resolution via
[#init] and the [#stringifyMapKey] helper.
///
/// @param <T> the format of the input record
-@SuppressWarnings({"rawtypes", "unchecked"})
public abstract class BaseRecordExtractor<T> implements RecordExtractor<T> {
Review Comment:
Same answer as the parallel thread from @xiangfu0 above — not maintaining
backward compat for the removed protected helpers. The OSS-shipped format
extractors are all self-contained now, the dispatcher was an internal-style
protected SPI rather than a public extension point, and downstream subclasses
that override or call the removed methods will need a one-time migration when
bumping their Pinot dependency.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]