Rich-T-kid commented on code in PR #21589:
URL: https://github.com/apache/datafusion/pull/21589#discussion_r3096497882


##########
datafusion/physical-plan/src/aggregates/group_values/single_group_by/dictionary.rs:
##########
@@ -0,0 +1,1488 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::aggregates::group_values::GroupValues;
+use crate::hash_utils::RandomState;
+use arrow::array::{
+    Array, ArrayRef, BinaryArray, BinaryBuilder, BinaryViewArray, 
BinaryViewBuilder,
+    DictionaryArray, LargeBinaryArray, LargeBinaryBuilder, LargeStringArray,
+    LargeStringBuilder, PrimitiveArray, PrimitiveBuilder, StringArray, 
StringBuilder,
+    StringViewArray, StringViewBuilder, UInt64Array,
+};
+use arrow::datatypes::{ArrowDictionaryKeyType, ArrowNativeType, DataType};
+use datafusion_common::Result;
+use datafusion_common::hash_utils::create_hashes;
+use datafusion_expr::EmitTo;
+use std::collections::HashMap;
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+pub struct GroupValuesDictionary<K: ArrowDictionaryKeyType + Send> {
+    /*
+    We know that every single &[ArrayRef] that is passed in is a dictionary 
array
+
+    self.inter() will be called across record batches, this means that
+    we cannot rely on a trivial approach where we just store the dictionary 
mapping as it is
+
+
+
+    Possible soluitions:
+    1A. store a hashmap that last across .intern() calls
+        | cast cols:&[ArrayRef] to generic Dictionary array, check if weve 
already stored its values (unique values) before
+        | if we have check the current mapping internally and update the 
groups array with the initial mapping for this value
+        | if it does not exist already (hashmap.size) is the group_id for this 
element
+    1B. how do we retrieve the dictionary encoded array this function expects?
+        | NOTE: emit returns one value per group not one value per row. The 
group values are the distinct values in the order they were first seen — not 
the full expanded key array [one per group index]
+        | keep a value_order array that stores unique elements the first time 
their seen, this maintains order for self.emit()
+        | the return type of the array self.emit() returns is based on the 
value type of the dictionary, may be smart to have an internal Group values 
that handles that logic
+        |
+
+    Possible optimizations (Ignore for now)
+    2A. dont rely directly in a hashmap we could hash all of the values at 
once and then as we iterate the keys array refer to them as the values are 
assumed to be smaller than the keys
+        | at the start of self.intern hash every value in the dictionary
+        | iterate through the keys section of dict_array
+            | for each key check its corresponding value and if it exist
+
+
+    */
+    // stores the order new unique elements are seen for self.emit()
+    seen_elements: Vec<Vec<u8>>, //  Box<dyn Builder> doesnt provide the 
flexibility of building partition arrays that wed need to support emit::First(N)

Review Comment:
   I think this is a good idea with the current approach. The only issue I can 
think of is in the future when other value types are supported. `ArrowBytesMap` 
only supports, 
       _Utf8,
       Utf8View,
       Binary,
       BinaryView,_
   in my revised PR I could abstract away how `seen_elements` are stored into a 
trait comprised of`store()` `retrive(N)`.This should make it easier for future 
work to integrate with `ArrowDictionaryKeyType` while also giving us a 
performance boost in the string/binary case



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to