pitrou commented on issue #46128:
URL: https://github.com/apache/arrow/issues/46128#issuecomment-2820577774

   > The above logic prevents casting scenarios, such as 8 GB of data in a 
LargeString type (where each element's length is less than the maximum 
int32_t), to a StringView. I believe a simple implementation could be feasible, 
though I recognize it would overlook the key benefit of StringView types, which 
is to create a view of a data buffer while avoiding large data copies.
   
   A better implementation would be to keep the original buffer but slice it in 
multiple view buffers. This is exactly what the comment suggests:
   ```c++
          // A more complicated loop could work by slicing the data buffer into 
          // more than one variadic buffer, but this is probably overkill for 
now 
          // before someone hits this problem in practice. 
   ```
   
   > As mentioned in [1] (Section 4), a key feature of the String/Binary View 
is preventing duplication. I believe this approach could be useful when 
converting from Fixed to String/Binary View types to minimize memory usage.
   
   Well, as your linked article says, _"this process is expensive and involves 
hashing every string and maintaining a hash table, and so it cannot be done by 
default when creating a StringViewArray"_. So this should probably be a 
dedicated compute function.
   
   > The implementation above may lead to a state where it owns a buffer but 
utilizes only a portion of it. Is it worthwhile to create a garbage collector 
and transfer the data to a new buffer?
   
   Similarly, consolidating string view buffers (with tunable heuristics) could 
be a dedicated compute function.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to