thorfour opened a new issue, #36671:
URL: https://github.com/apache/arrow/issues/36671

   ### Describe the enhancement requested
   
   The memotable `GetOrInsert` function was implement with the `interface{}` to 
make it generic between builders. 
   
https://github.com/apache/arrow/blob/7690409568e8a4b51946f292b109075629ed1ee7/go/internal/hashing/xxh3_memo_table.go#L55
   
   However this empty interface conversion can come with a cost. We've noticed 
that when building large binary dictionary arrays that a large portion of CPU 
and memory allocations are spent converting the byte slice into `interface{}` 
   
   CPU:
   
![cpu_usage](https://github.com/apache/arrow/assets/8681572/8ab6785c-3606-4869-9dd2-a6dbbba05d78)
   
   Allocs:
   
![binary_dictionary_append](https://github.com/apache/arrow/assets/8681572/ab012ae5-55ad-4aef-82a6-f82a99b0ccbb)
   
   I copy/pastad a new memo table that only operates on `[]byte` to prove this 
theory; below is the profile with those changes.
   
   CPU:
   
![image](https://github.com/apache/arrow/assets/8681572/eb7bb691-a61d-4cff-a827-65a60b30dc43)
   
   Allocs:
   
![binary_dictionary_append_after](https://github.com/apache/arrow/assets/8681572/1537f753-3eaa-41aa-b358-09e0ae41087d)
   
   I see two straightforward ways to potentially address this. Use a `[]byte` 
specific memo table for `BinaryDictionaryBuilder` since it's going to be the 
only builder that suffers from this type of allocation. Or to replace the 
memotables with generics 
   
   ```go
   type MemoTable[T any] interface {
   ...
       GetOrInsert(val T) (idx int, existed bool, err error) 
   ...
   }
   ```
   
   ### Component(s)
   
   Go


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to