scott-routledge2 commented on PR #43661: URL: https://github.com/apache/arrow/pull/43661#issuecomment-3518523113
> Ok, I think this is approaching the problem from the wrong end. > > 1. We should fix the issue in `CastBinaryToBinaryOffsets` so that the O(N^2) behavior cannot occur anymore > > 2. Also, we are now able to read Parquet [directly as LargeString](https://github.com/apache/arrow/pull/46532), so the intermediate cast is not required anymore I would be happy to open a separate issue to look at `CastBinaryToBinaryOffsets`, however, I wasn't sure how to make this cast more efficient without changing the API significantly? A single cast of a slice is O(offset + length), since a new buffer for the string offsets needs to be created with the same shape as the original slice. Casting slices of a size N array only becomes O(N^2) when length of the slices << N, so to me, the most straightforward approach would be to change how the casting is being invoked i.e. casting the entire array at once vs slices. For the second point, the cast here was not a workaround for reading LargeString data, but rather, we were working with datasets that have a single, unified schema and individual files that potentially have different schemas (e.g. string vs large_string ), so the casting would still be necessary at some point to reconcile the differences in schema. This cast is currently handled in [MakeExecBatch](https://github.com/apache/arrow/blob/6a2e19a852b367c72d7b12da4d104456491ed8b7/cpp/src/arrow/compute/expression.cc#L672), but there is comment that this should be done by the reader, so I believe this PR would be an incremental step in achieving this goal, but I was wondering what your thoughts would be? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
