zhli1142015 commented on code in PR #12092:
URL: https://github.com/apache/gluten/pull/12092#discussion_r3246844614
##########
cpp/velox/operators/serializer/VeloxColumnarBatchSerializer.cc:
##########
@@ -95,4 +98,405 @@ std::shared_ptr<ColumnarBatch>
VeloxColumnarBatchSerializer::deserialize(uint8_t
return std::make_shared<VeloxColumnarBatch>(result);
}
+namespace {
+
+// Per-type FlatVector min/max scan + NaN guard. Returns false when the column
must be marked
+// unsupported (any NaN observed for floating-point types -- Spark equality
NaN != NaN means
+// min/max-based pruning would silently drop matching rows).
+//
+// Floating-point edge cases that DO NOT poison the column:
+// - +/-Infinity: ordered (-Inf < x < +Inf for finite x); participate in
min/max normally.
+// - +0 and -0: IEEE 754 declares them equal under <, ==; min/max bound is
correct either way.
+// - subnormal (denormal) values: ordered like normal floats; no special
handling needed.
+template <typename T>
+bool scanMinMax(const facebook::velox::FlatVector<T>* flat, T& tLo, T& tHi,
int64_t& nullCnt, bool& seen) {
+ const auto size = flat->size();
+ const uint64_t* nulls = flat->rawNulls();
+ const T* values = flat->rawValues();
+ for (vector_size_t i = 0; i < size; ++i) {
+ if (nulls != nullptr && bits::isBitNull(nulls, i)) {
+ ++nullCnt;
+ continue;
+ }
+ T v = values[i];
+ if constexpr (std::is_floating_point_v<T>) {
+ if (std::isnan(v)) {
+ return false;
+ }
+ }
+ if (!seen) {
+ tLo = v;
+ tHi = v;
+ seen = true;
+ } else {
+ if (v < tLo)
+ tLo = v;
+ if (v > tHi)
+ tHi = v;
+ }
+ }
+ return true;
+}
+
+} // namespace
+
+std::vector<ColumnStats>
VeloxColumnarBatchSerializer::computeStats(RowVectorPtr rowVector) {
+ std::vector<ColumnStats> result;
+ const auto numCols = rowVector->childrenSize();
+ result.resize(numCols);
+ for (column_index_t col = 0; col < numCols; ++col) {
+ auto& stats = result[col];
+ auto child = rowVector->childAt(col);
+ if (child == nullptr || !child->isFlatEncoding()) {
Review Comment:
Skipping non-flat children leaves `ColumnStats::nullCount` at the default 0,
but unsupported columns still serialize that null count with `emitSupported =
0`. The JVM side can still use `nullCount` for `IsNull` pruning even when
min/max bounds are unsupported, so a dictionary/constant/complex column
containing nulls may be advertised as having no nulls and be incorrectly pruned
by `IS NULL`. Since `framedSerializeWithStats` passes `getRowVector()` directly
into `computeStats`, these encodings can still reach this path. Could we
compute nullCount independently of flat/min-max support (or flatten/decode
before stats) and cover non-flat/unsupported-type null stats?
##########
cpp/velox/operators/serializer/VeloxColumnarBatchSerializer.cc:
##########
@@ -95,4 +98,405 @@ std::shared_ptr<ColumnarBatch>
VeloxColumnarBatchSerializer::deserialize(uint8_t
return std::make_shared<VeloxColumnarBatch>(result);
}
+namespace {
+
+// Per-type FlatVector min/max scan + NaN guard. Returns false when the column
must be marked
+// unsupported (any NaN observed for floating-point types -- Spark equality
NaN != NaN means
+// min/max-based pruning would silently drop matching rows).
+//
+// Floating-point edge cases that DO NOT poison the column:
+// - +/-Infinity: ordered (-Inf < x < +Inf for finite x); participate in
min/max normally.
+// - +0 and -0: IEEE 754 declares them equal under <, ==; min/max bound is
correct either way.
+// - subnormal (denormal) values: ordered like normal floats; no special
handling needed.
+template <typename T>
+bool scanMinMax(const facebook::velox::FlatVector<T>* flat, T& tLo, T& tHi,
int64_t& nullCnt, bool& seen) {
+ const auto size = flat->size();
+ const uint64_t* nulls = flat->rawNulls();
+ const T* values = flat->rawValues();
Review Comment:
This helper is also instantiated by the BOOLEAN case below, but
`FlatVector<bool>::rawValues()` is unsupported in Velox and throws. With
partition stats enabled, materializing a cached `BooleanType` column can
therefore fail instead of just producing min/max stats. Could the bool path use
a bool-specific scan via `valueAt(i)`/`isNullAt(i)` (or the underlying bits)
and add a C++ or E2E regression test that caches a Boolean column with stats
enabled?
##########
cpp/velox/operators/serializer/VeloxColumnarBatchSerializer.cc:
##########
@@ -95,4 +98,405 @@ std::shared_ptr<ColumnarBatch>
VeloxColumnarBatchSerializer::deserialize(uint8_t
return std::make_shared<VeloxColumnarBatch>(result);
}
+namespace {
+
+// Per-type FlatVector min/max scan + NaN guard. Returns false when the column
must be marked
+// unsupported (any NaN observed for floating-point types -- Spark equality
NaN != NaN means
+// min/max-based pruning would silently drop matching rows).
+//
+// Floating-point edge cases that DO NOT poison the column:
+// - +/-Infinity: ordered (-Inf < x < +Inf for finite x); participate in
min/max normally.
+// - +0 and -0: IEEE 754 declares them equal under <, ==; min/max bound is
correct either way.
+// - subnormal (denormal) values: ordered like normal floats; no special
handling needed.
+template <typename T>
+bool scanMinMax(const facebook::velox::FlatVector<T>* flat, T& tLo, T& tHi,
int64_t& nullCnt, bool& seen) {
+ const auto size = flat->size();
+ const uint64_t* nulls = flat->rawNulls();
+ const T* values = flat->rawValues();
+ for (vector_size_t i = 0; i < size; ++i) {
+ if (nulls != nullptr && bits::isBitNull(nulls, i)) {
+ ++nullCnt;
+ continue;
+ }
+ T v = values[i];
+ if constexpr (std::is_floating_point_v<T>) {
+ if (std::isnan(v)) {
+ return false;
Review Comment:
Returning on the first NaN leaves `nullCnt` as only the nulls seen before
that NaN. Even when bounds are disabled, the framed stats still emit
`nullCount`, and Spark cache pruning uses `IsNull` as `statsFor(a).nullCount >
0`. A partition like `[NaN, null]` would be serialized with `nullCount = 0`, so
`col IS NULL` could prune away matching rows. Please keep scanning after
marking min/max unsupported (or compute nullCount separately) and add a
Float/Double NaN+null `IS NULL` regression test.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]