scott-routledge2 commented on PR #43661: URL: https://github.com/apache/arrow/pull/43661#issuecomment-3549132175
> Could you clarify how the `O(N^2)` complexity arises? I read the issue description but couldn’t quite follow it - probably because I’m not very familiar with the dataset code - so a short walkthrough would be really helpful. @zanmato1984 of course! It's not so much an issue with datasets as it is the actual casting kernel in compute as @pitrou pointed out. When casting small slices of size L of a largee array of length N, to cast the first slice, we have to allocate a new offsets buffer for the up/downcasted int offsets that has size L+1. The second slice will have offset L, and [because of the way casting is implemented](https://github.com/bodo-ai/arrow-bodo-fork/blob/5eaf553bfc7aa639fd67bd622b6b808e71fbba39/cpp/src/arrow/compute/kernels/scalar_cast_string.cc#L294), we will actually need to allocate a buffer of size 2L+1 and zero out the first L elements since the output of the cast inherits the offset from the input slice. Continuing this pattern, the amount of work required is roughly, L + 2L + 3L + ... N ~ N^2/2L or O(N^2) if L is constant. Where this relates to datasets is when reading a parquet file, ScanAsyncBatches essentially is reading one row group at a time, and then produces slices on top that row group, which then need to be casted in MakeExecBatch, following the pattern described above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
