[GitHub] [arrow-rs] jorgecarleitao commented on issue #461: Implement RecordBatch::concat

GitBox Sun, 11 Jul 2021 07:25:39 -0700


jorgecarleitao commented on issue #461:
URL: https://github.com/apache/arrow-rs/issues/461#issuecomment-877808997



   Let me try to explain my reasoning atm.
   
   All methods exposed on `Array` are `O(1)`. In particular, `.slice` is `O(1)` 
over the array, and thus `O(c)` over the record where `c` is the number of 
fields.
   
   `concat` over `RecordBatch` seems rather simple but is `O(c * n * r)` where 
c is the number of columns, r the number of records, and `n` the average length 
of the records. Since `c` is trivially parallelizable, I would say that the 
natural implementation is to actually rayon it, i.e. `columns().iter_par()...`.
   
   Generally, I consider non-parallel iterations over a record to be an 
anti-pattern, since parallelism over columns is one of the hallmarks of 
columnar formats. Imo the decision of how to iterate over columns does not 
belong to `arrow-rs`, but to Polars,  DataFusion and the like. `DataFusion` 
uses `iter`; polars uses `iter_par` for the most part.
   
   We do have some methods in compute that `iter` over the `RecordBatch` that 
follow this pattern (`filter` and `sort` I believe). So, in this context, I 
would be more inclined to place `concat_record` at the same level as them: 
methods that are not `O(1)` over the arrays' length that some may use when they 
do not want to commit to a multi-threaded execution. But again, imo this is an 
anti-pattern that we should not promote, as it enforces a specific threading 
model over columns.
   
   The reasoning to have it in `compute` is to not drag compute dependencies to 
the core modules (I see them as being the datatypes, array, buffer, 
RecordBatch, alloc). The reason being that `compute` has a massive compile time 
when compared to the rest of the crate, and keeping these separated makes it 
easier to split `arrow` in two crates (`arrow` and `arrow-compute`) to reduce 
compile and/or binary size. This is minor and can be solved by moving `impl 
RecordBatch` to the `compute::kernels::concat` if the time arises.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] jorgecarleitao commented on issue #461: Implement RecordBatch::concat

Reply via email to