westonpace commented on a change in pull request #10729:
URL: https://github.com/apache/arrow/pull/10729#discussion_r671593544



##########
File path: cpp/src/parquet/statistics.cc
##########
@@ -567,6 +568,27 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     SetMinMaxPair(comparator_->GetMinMax(values));
   }
 
+  void UpdateArrowDictionary(const ::arrow::Array& indices,
+                             const ::arrow::Array& dictionary) {
+    IncrementNullCount(indices.null_count());
+    IncrementNumValues(indices.length() - indices.null_count());
+
+    if (indices.null_count() == indices.length()) {
+      return;
+    }
+
+    ::arrow::compute::ExecContext ctx(pool_);
+    PARQUET_ASSIGN_OR_THROW(auto referenced_indices,
+                            ::arrow::compute::Unique(indices, &ctx));
+    PARQUET_ASSIGN_OR_THROW(

Review comment:
       First, I'm not sure if it was clear, but there are actually two array 
allocations here.  First it creates a unique array which is a (hopefully small) 
subset of the indices.  Second, it creates a referenced values array which is a 
(probably not much smaller) subset of the values.  Sadly the MinMax functions 
don't yet support strings.  There are kernels (I think) to get a sort order but 
I can't see how that will help because they don't take selection vectors yet.  
We could take the output from `Unique` improve Comparator to take a selection 
vector.  This would save the second allocation.  However, I'm reluctant to 
undertake that work when I think the eventual solution will be for the MinMax 
kernels to support all types.  I've created PARQUET-2068 for follow-up.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to