yaooqinn commented on code in PR #12107:
URL: https://github.com/apache/gluten/pull/12107#discussion_r3260064506


##########
cpp/velox/shuffle/VeloxHashShuffleWriter.cc:
##########
@@ -1315,6 +1316,47 @@ uint64_t 
VeloxHashShuffleWriter::valueBufferSizeForFixedWidthArray(uint32_t fixe
   return valueBufferSize;
 }
 
+void VeloxHashShuffleWriter::accumulateInputEncodingCounts(const 
ColumnarBatch& cb) {
+  // Only velox-typed batches expose per-child encoding; foreign batches
+  // (e.g. arrow round-trips coming from non-velox sources) will be flattened
+  // by `VeloxColumnarBatch::from` later and we'd undercount, so just skip
+  // them here rather than reporting a misleading "all flat" mix.
+  if (cb.getType() != "velox") {
+    return;
+  }
+  const auto* veloxBatch = dynamic_cast<const VeloxColumnarBatch*>(&cb);
+  if (veloxBatch == nullptr) {
+    return;
+  }
+  const auto& rowVector = veloxBatch->getRowVector();
+  if (rowVector == nullptr) {
+    return;
+  }
+  for (const auto& child : rowVector->children()) {
+    if (child == nullptr) {
+      ++inputEncodingCounts_[kInputEncodingOther];
+      continue;
+    }
+    switch (child->encoding()) {
+      case facebook::velox::VectorEncoding::Simple::FLAT:
+        ++inputEncodingCounts_[kInputEncodingFlat];
+        break;
+      case facebook::velox::VectorEncoding::Simple::DICTIONARY:
+        ++inputEncodingCounts_[kInputEncodingDictionary];
+        break;
+      case facebook::velox::VectorEncoding::Simple::CONSTANT:
+        ++inputEncodingCounts_[kInputEncodingConstant];
+        break;
+      case facebook::velox::VectorEncoding::Simple::LAZY:
+        ++inputEncodingCounts_[kInputEncodingLazy];
+        break;
+      default:
+        ++inputEncodingCounts_[kInputEncodingOther];

Review Comment:
   `facebook::velox::VectorEncoding::Simple` 
(velox/vector/VectorEncoding.h:29-41) has 11 values; the switch handles 4 and 
the default funnels the remaining 7 (BIASED, SEQUENCE, ROW, MAP, FLAT_MAP, 
ARRAY, FUNCTION) into "Other". ROW / MAP / ARRAY are first-class struct / map / 
array column types — the sibling `VeloxShuffleWriterTest.cc:488,492` exercises 
exactly those via `makeArrayVector` / `makeMapVector`, so the shuffle writer is 
expected to see them in normal Spark workloads. A reader interpreting 
"Other=80%" in the log will assume rare/unknown encodings, not "I have a struct 
column". Would you consider renaming `Other` to `OtherOrComplex` and/or 
splitting out a `kInputEncodingComplex` bucket for ROW / MAP / FLAT_MAP / ARRAY?



##########
cpp/velox/shuffle/VeloxHashShuffleWriter.cc:
##########
@@ -1328,6 +1370,22 @@ void VeloxHashShuffleWriter::stat() const {
     }
     LOG(INFO) << oss.str();
   }
+  {
+    std::ostringstream oss;
+    oss << "Velox shuffle writer stat:InputEncoding";
+    int64_t total = 0;
+    for (auto v : inputEncodingCounts_) {
+      total += v;
+    }
+    for (int b = 0; b < kInputEncodingNum; ++b) {
+      auto v = inputEncodingCounts_[b];
+      oss << " " << inputEncodingName(static_cast<InputEncodingBucket>(b)) << 
"=" << v;
+      if (total > 0) {
+        oss << "(" << (100.0 * static_cast<double>(v) / 
static_cast<double>(total)) << "%)";

Review Comment:
   Since `accumulateInputEncodingCounts` early-returns on non-velox batches 
(line 1322), `total` here covers only velox-typed children, while the sibling 
`cpuWallTimingList_` block above prints raw counts over all `write()` calls. 
Worth printing the skip count next to the percentages so the two log lines have 
a comparable denominator?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to