zhuqi-lucas commented on issue #7350: URL: https://github.com/apache/arrow-rs/issues/7350#issuecomment-2767941953
Thank you @XiangpengHao @alamb , i was thinking to support longer inline prefix for StringView to compare, but it looks like it's always fixed to 4 bytes, we can't change it easily. > > we add new new ByteView to support 8bytes prefix > > I think Arrow spec says we need to do 4 bytes prefix: https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout > > As you have pointed out, StringViewArray is not always better than StringArray, especially when the prefixes are the same. > > But I do believe there are micro-architecture level optimizations we can do to improve performance, like better compiler hint, prefetching, gc tuning etc. > > Another direction is probably to rewrite the FilterExec/CoalesenceExec to emit StringArray rather than StringViewArray, the idea is to use StringView in lower levels of the plan and use String in higher levels of the plan I agree, the linked PR using GC to as a workaround for sort merge compare cases. > I do think theoretically StringArray is likely to be faster than StringViewArray for larger strings in many cases as it is more efficient (it has fewer indirections) > > > Another direction is probably to rewrite the FilterExec/CoalesenceExec to emit StringArray rather than StringViewArray, the idea is to use StringView in lower levels of the plan and use String in higher levels of the plan > > that is a very interesting idea 🤔 For FilterExec/CoalesenceExec, interesting, this is using GC to reduce the overhead of FilterExec/CoalesenceExec. May be we can try rewrite the FilterExec/CoalesenceExec to emit StringArray and to compare the gain and loss. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
