Re: [PR] Improve documentation on `StringArrayType` trait [datafusion]

via GitHub Sat, 17 Aug 2024 17:52:53 -0700


XiangpengHao commented on PR #12027:
URL: https://github.com/apache/datafusion/pull/12027#issuecomment-2295047337


   Thank you for the very detailed documentation! I think the enum-based 
approach indeed simplifies the interface.
   
   But I have two concerns:
   (1) every time we call a StringArrays function, e.g., `StringArrays::value`, 
we need to match the enum type which is an extra branch. This branch can be a 
performance bottleneck for high-performance/large volumes of data. This branch 
is required for every operation this enum implements and there's no easy way to 
work around it. Instead, the generic approach (the current one) makes this 
match early in the function call and won't need to recheck the data type.
   
   (2) The reason we have multiple types of string is to allow specialized 
implementation, for example, the [recent `substr` 
operation](https://github.com/apache/datafusion/pull/12044). In that case, we 
need to write different code for different string types. This means that we 
won't simply use `StringArrays::value` to get string value, but instead, we 
will first get the `view` of a string, then get the underlying buffer, 
sometimes we even want to know the buffer id.
   
   With that said, I do appreciate the thoughts you put into unifying the 
string representations, I 100% agree that we have too many strings to deal 
with. I hope that we can get `StringView` in DataFusion by default soon, and 
then probably simplify our implementation by using a subset of the string types.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Improve documentation on `StringArrayType` trait [datafusion]

Reply via email to