emkornfield commented on a change in pull request #10729:
URL: https://github.com/apache/arrow/pull/10729#discussion_r670963422



##########
File path: cpp/src/parquet/statistics.cc
##########
@@ -567,6 +568,27 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
     SetMinMaxPair(comparator_->GetMinMax(values));
   }
 
+  void UpdateArrowDictionary(const ::arrow::Array& indices,
+                             const ::arrow::Array& dictionary) {
+    IncrementNullCount(indices.null_count());
+    IncrementNumValues(indices.length() - indices.null_count());
+
+    if (indices.null_count() == indices.length()) {
+      return;
+    }
+
+    ::arrow::compute::ExecContext ctx(pool_);
+    PARQUET_ASSIGN_OR_THROW(auto referenced_indices,
+                            ::arrow::compute::Unique(indices, &ctx));
+    PARQUET_ASSIGN_OR_THROW(

Review comment:
       this allocate a whole new array?  maybe we should file a follow-up JIRA 
to make this more efficient.  
   
   In particular it seems like it could be more efficient to get map of 
dictionary index to sort order (I think we already have a kernel for this) if 
necessary each time before calculating the statistics if there are any new 
entries then iterate through the indices doing a comparison in index space.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to