scott-routledge2 commented on PR #43661:
URL: https://github.com/apache/arrow/pull/43661#issuecomment-3518523113

   > Ok, I think this is approaching the problem from the wrong end.
   > 
   >     1. We should fix the issue in `CastBinaryToBinaryOffsets` so that the 
O(N^2) behavior cannot occur anymore
   > 
   >     2. Also, we are now able to read Parquet [directly as 
LargeString](https://github.com/apache/arrow/pull/46532), so the intermediate 
cast is not required anymore
   
   I would be happy to open a separate issue to look at 
`CastBinaryToBinaryOffsets`, however, I wasn't sure how to make this cast more 
efficient without changing the API significantly? A single cast of a slice is 
O(offset + length), since a new buffer for the string offsets needs to be 
created with the same shape as the original slice. Casting slices of a size N 
array only becomes O(N^2) when length of the slices << N, so to me, the most 
straightforward approach would be to change how the casting is being invoked 
i.e. casting the entire array at once vs slices.
   
   For the second point, the cast here was not a workaround for reading 
LargeString data, but rather, we were working with datasets that have a single, 
unified schema and individual files that potentially have different schemas 
(e.g. string vs large_string ), so the casting would still be necessary at some 
point to reconcile the differences in schema. This cast is currently handled in 
[MakeExecBatch](https://github.com/apache/arrow/blob/6a2e19a852b367c72d7b12da4d104456491ed8b7/cpp/src/arrow/compute/expression.cc#L672),
 but there is comment that this should be done by the reader, so I believe this 
PR would be an incremental step in achieving this goal, but I was wondering 
what your thoughts would be?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to