XiangpengHao commented on PR #12027: URL: https://github.com/apache/datafusion/pull/12027#issuecomment-2295047337
Thank you for the very detailed documentation! I think the enum-based approach indeed simplifies the interface. But I have two concerns: (1) every time we call a StringArrays function, e.g., `StringArrays::value`, we need to match the enum type which is an extra branch. This branch can be a performance bottleneck for high-performance/large volumes of data. This branch is required for every operation this enum implements and there's no easy way to work around it. Instead, the generic approach (the current one) makes this match early in the function call and won't need to recheck the data type. (2) The reason we have multiple types of string is to allow specialized implementation, for example, the [recent `substr` operation](https://github.com/apache/datafusion/pull/12044). In that case, we need to write different code for different string types. This means that we won't simply use `StringArrays::value` to get string value, but instead, we will first get the `view` of a string, then get the underlying buffer, sometimes we even want to know the buffer id. With that said, I do appreciate the thoughts you put into unifying the string representations, I 100% agree that we have too many strings to deal with. I hope that we can get `StringView` in DataFusion by default soon, and then probably simplify our implementation by using a subset of the string types. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org