jhorstmann commented on a change in pull request #382:
URL: https://github.com/apache/arrow-rs/pull/382#discussion_r644175597



##########
File path: arrow/src/array/transform/mod.rs
##########
@@ -338,33 +338,61 @@ fn preallocate_str_buffer<Offset: StringOffsetSizeTrait>(
     } else {
         buffer.push(0i32)
     }
-    let str_values_size = arrays
-        .iter()
-        .map(|data| {
-            // get the length of the value buffer
-            let buf_len = data.buffers()[1].len();
-            // find the offset of the buffer
-            // this returns a slice of offsets, starting from the offset of 
the array
-            // so we can take the first value
-            let offset = data.buffer::<Offset>(0)[0];
-            buf_len - offset.to_usize().unwrap()
-        })
-        .sum::<usize>();
 
     [
         buffer,
-        MutableBuffer::new(str_values_size * mem::size_of::<u8>()),
+        MutableBuffer::new(binary_size * mem::size_of::<u8>()),
     ]
 }
 
+/// Define capacities of child data or data buffers.
+#[derive(Debug, Clone)]
+pub enum Capacities {
+    /// Binary, Utf8 and LargeUtf8 data types
+    /// Define
+    /// * the capacity of the array offsets
+    /// * the capacity of the binary/ str buffer
+    Binary(usize, Option<usize>),
+    /// List and LargeList data types
+    /// Define
+    /// * the capacity of the array offsets
+    /// * the capacity of the child data
+    List(usize, Option<Box<Capacities>>),
+    /// Struct type
+    /// * the capacity of the array
+    /// * the capacities of the fields
+    Struct(usize, Option<Vec<Capacities>>),
+    /// Dictionary type
+    /// * the capacity of the array

Review comment:
       What is the difference here between the "capacity of the array" and "the 
capacity of the keys"?
   
   I have a rather special usage of concat for dictionary arrays, where I can 
guarantee that the dictionary values of all arrays are the same, or at least 
that the dictionary of the last array includes all previous dictionaries in the 
same order. I wonder whether that could be supported here and also fix the 
current concatenation logic that could lead to duplicated dictionary values. 
The rough idea would be to have another field here with an
   ```
   enum DictionaryMergeStrategy {
       Append(Option<usize>), // current logic, just append dictionary values 
with an optional capacity
       Merge(Option<usize>), // actually merge the dictionaries using a 
hashmap, to be imlemented in a followup
       UseLastValues, // use the values of the last dictionary array, might 
want to validate that this has a size at least as large as all other 
dictionaries so all keys are guaranteed to be in range
   }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to