emkornfield commented on a change in pull request #10729:
URL: https://github.com/apache/arrow/pull/10729#discussion_r670963422
##########
File path: cpp/src/parquet/statistics.cc
##########
@@ -567,6 +568,27 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
SetMinMaxPair(comparator_->GetMinMax(values));
}
+ void UpdateArrowDictionary(const ::arrow::Array& indices,
+ const ::arrow::Array& dictionary) {
+ IncrementNullCount(indices.null_count());
+ IncrementNumValues(indices.length() - indices.null_count());
+
+ if (indices.null_count() == indices.length()) {
+ return;
+ }
+
+ ::arrow::compute::ExecContext ctx(pool_);
+ PARQUET_ASSIGN_OR_THROW(auto referenced_indices,
+ ::arrow::compute::Unique(indices, &ctx));
+ PARQUET_ASSIGN_OR_THROW(
Review comment:
this allocate a whole new array? maybe we should file a follow-up JIRA
to make this more efficient.
In particular it seems like it could be more efficient to get map of
dictionary index to sort order (I think we already have a kernel for this) if
necessary each time before calculating the statistics if there are any new
entries then iterate through the indices doing a comparison in index space.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]