pitrou commented on issue #46128: URL: https://github.com/apache/arrow/issues/46128#issuecomment-2820577774
> The above logic prevents casting scenarios, such as 8 GB of data in a LargeString type (where each element's length is less than the maximum int32_t), to a StringView. I believe a simple implementation could be feasible, though I recognize it would overlook the key benefit of StringView types, which is to create a view of a data buffer while avoiding large data copies. A better implementation would be to keep the original buffer but slice it in multiple view buffers. This is exactly what the comment suggests: ```c++ // A more complicated loop could work by slicing the data buffer into // more than one variadic buffer, but this is probably overkill for now // before someone hits this problem in practice. ``` > As mentioned in [1] (Section 4), a key feature of the String/Binary View is preventing duplication. I believe this approach could be useful when converting from Fixed to String/Binary View types to minimize memory usage. Well, as your linked article says, _"this process is expensive and involves hashing every string and maintaining a hash table, and so it cannot be done by default when creating a StringViewArray"_. So this should probably be a dedicated compute function. > The implementation above may lead to a state where it owns a buffer but utilizes only a portion of it. Is it worthwhile to create a garbage collector and transfer the data to a new buffer? Similarly, consolidating string view buffers (with tunable heuristics) could be a dedicated compute function. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org