scott-routledge2 commented on PR #43661:
URL: https://github.com/apache/arrow/pull/43661#issuecomment-3549132175

   > Could you clarify how the `O(N^2)` complexity arises? I read the issue 
description but couldn’t quite follow it - probably because I’m not very 
familiar with the dataset code - so a short walkthrough would be really helpful.
   
   @zanmato1984 of course! It's not so much an issue with datasets as it is the 
actual casting kernel in compute as @pitrou pointed out. 
   
   When casting small slices of size L of a largee array of length N, to cast 
the first slice, we have to allocate a new offsets buffer for the up/downcasted 
int offsets that has size L+1. The second slice will have offset L, and 
[because of the way casting is 
implemented](https://github.com/bodo-ai/arrow-bodo-fork/blob/5eaf553bfc7aa639fd67bd622b6b808e71fbba39/cpp/src/arrow/compute/kernels/scalar_cast_string.cc#L294),
 we will actually need to allocate a buffer of size 2L+1 and zero out the first 
L elements since the output of the cast inherits the offset from the input 
slice. Continuing this pattern, the amount of work required is roughly, L + 2L 
+ 3L + ...  N ~ N^2/2L or O(N^2) if L is constant. 
   
   Where this relates to datasets is when reading a parquet file, 
ScanAsyncBatches essentially is reading one row group at a time, and then 
produces slices on top that row group, which then need to be casted in 
MakeExecBatch, following the pattern described above.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to