jorgecarleitao commented on issue #461: URL: https://github.com/apache/arrow-rs/issues/461#issuecomment-877808997
Let me try to explain my reasoning atm. All methods exposed on `Array` are `O(1)`. In particular, `.slice` is `O(1)` over the array, and thus `O(c)` over the record where `c` is the number of fields. `concat` over `RecordBatch` seems rather simple but is `O(c * n * r)` where c is the number of columns, r the number of records, and `n` the average length of the records. Since `c` is trivially parallelizable, I would say that the natural implementation is to actually rayon it, i.e. `columns().iter_par()...`. Generally, I consider non-parallel iterations over a record to be an anti-pattern, since parallelism over columns is one of the hallmarks of columnar formats. Imo the decision of how to iterate over columns does not belong to `arrow-rs`, but to Polars, DataFusion and the like. `DataFusion` uses `iter`; polars uses `iter_par` for the most part. We do have some methods in compute that `iter` over the `RecordBatch` that follow this pattern (`filter` and `sort` I believe). So, in this context, I would be more inclined to place `concat_record` at the same level as them: methods that are not `O(1)` over the arrays' length that some may use when they do not want to commit to a multi-threaded execution. But again, imo this is an anti-pattern that we should not promote, as it enforces a specific threading model over columns. The reasoning to have it in `compute` is to not drag compute dependencies to the core modules (I see them as being the datatypes, array, buffer, RecordBatch, alloc). The reason being that `compute` has a massive compile time when compared to the rest of the crate, and keeping these separated makes it easier to split `arrow` in two crates (`arrow` and `arrow-compute`) to reduce compile and/or binary size. This is minor and can be solved by moving `impl RecordBatch` to the `compute::kernels::concat` if the time arises. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
