alamb commented on code in PR #6005:
URL: https://github.com/apache/arrow-rs/pull/6005#discussion_r1667363866


##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -199,6 +235,36 @@ impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
             return;
         }
 
+        // if string is long, we can optionally check duplication.

Review Comment:
   This comment seems incorrect -- this code checks deduplication if there is 
string tracking enabled, not if the string is long



##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -57,6 +58,7 @@ pub struct GenericByteViewBuilder<T: ByteViewType + ?Sized> {
     completed: Vec<Buffer>,
     in_progress: Vec<u8>,
     block_size: u32,
+    string_tracker: Option<(HashTable<usize>, ahash::RandomState)>, // map 
<string hash> -> <index to the views>

Review Comment:
   ```suggestion
       /// Some if deduplicating strings
       /// map `<string hash> -> <index to the views>`
       string_tracker: Option<(HashTable<usize>, ahash::RandomState)>
   ```



##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -83,6 +86,19 @@ impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
         Self { block_size, ..self }
     }
 
+    /// Deduplicate strings while building the array

Review Comment:
   Calling this `with_dedupliate_strings()` rather than `deduplicate_strings` 
might make it more obvious it is a builder style API, but that is just a 
preference
   
   I think putting a space between the description will make the docs render 
nicely (the first sentence / line is used for the summary display)
   
   💯  for explaining the tradeoff when using this option. 
   
   ```suggestion
       /// Deduplicate strings while building the array
       ///
   ```



##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -199,6 +235,36 @@ impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
             return;
         }
 
+        // if string is long, we can optionally check duplication.
+        let hash_val = match &self.string_tracker {
+            Some((ht, hasher)) => {
+                let hash_val = hasher.hash_one(v);
+
+                // we can not use ht.entry here because we are under immutable 
borrow.
+                let find = ht.find(hash_val, |idx| {
+                    let stored_value = self.get_value(*idx);
+                    v == stored_value
+                });
+
+                // If the string already exists, we will directly use the view
+                if let Some(idx) = find {
+                    self.views_builder
+                        .append(self.views_builder.as_slice()[*idx]);
+                    self.null_buffer_builder.append_non_null();
+                    return;
+                }
+                // return the hash value to be reused later.
+                Some(hash_val)
+            }
+            None => None,
+        };
+        let view_idx = self.views_builder.len();
+        if let Some(hash_val) = hash_val {

Review Comment:
   you might be abkle to avoid the unwrap by duplicating the `view_idx = 
self.views_builder.len()` in the if ... above. I think this way is fine too I 
just wanted to point it out



##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -199,6 +235,36 @@ impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
             return;
         }
 
+        // if string is long, we can optionally check duplication.
+        let hash_val = match &self.string_tracker {
+            Some((ht, hasher)) => {
+                let hash_val = hasher.hash_one(v);
+
+                // we can not use ht.entry here because we are under immutable 
borrow.

Review Comment:
   One pattern that can be used to use `ht.entry` if you want is something like
   
   ```rust
   if let Some(ht) = self.string_tracker.take() {
    ...
    // put the hash table back
    self.string_tracker = Some(ht)
   }
   ```
   
   



##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -357,6 +423,40 @@ mod tests {
     use super::*;
     use crate::Array;
 
+    #[test]
+    fn test_string_view_deduplicate() {
+        let mut builder = StringViewBuilder::new().deduplicate_strings();
+
+        let duplicated_value = "long string to test string view";
+        builder.append_value(duplicated_value);
+        builder.append_value("short");
+        builder.append_value(duplicated_value);
+        builder.append_null();
+        builder.append_value(duplicated_value);
+
+        let array = builder.finish_cloned();
+        array.to_data().validate_full().unwrap();
+        assert_eq!(array.data_buffers().len(), 1);

Review Comment:
   I think this test would pass even if the deduplicate strings was not enabled
   
   Perhaps you could add an assert that `array.data_buffers()[0].len() == 
duplicated_value.len()` only 🤔 



##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -357,6 +423,40 @@ mod tests {
     use super::*;
     use crate::Array;
 
+    #[test]
+    fn test_string_view_deduplicate() {

Review Comment:
   Can you please also add coverage when the duplicated string is in finished 
buffer (aka a buffer other than the current one)?
   
   Maybe that can be done by setting capacity to `duplicated_value.len() *2` or 
something



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to