[ 
https://issues.apache.org/jira/browse/ARROW-39?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232285#comment-15232285
 ] 

Wes McKinney commented on ARROW-39:
-----------------------------------

This JIRA isn't about IPC; this is about writing in-memory algorithms for 
manipulating tables consisting of multiple chunks (multiple record batches). 
For example: performing a merge-sort in-memory on a table consisting of many 
row batches. The only point here was that, computationally, there's probably 
not much benefit in implementing algorithms that deal with tables with 
irregular chunk sizes (where random access in general requires a binary search) 
or chunk sizes that are not a power of two (since arithmetic with 
non-powers-of-2 requires a lot more CPU cycles).

The context for this JIRA originally was this data structure: 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/column.h#L37

It's fine as a data container, but as soon as you start writing analytics 
(think: pandas, dplyr, and things like that) these things become a problem. 

> C++: Logical chunked arrays / columns: conforming to fixed chunk sizes
> ----------------------------------------------------------------------
>
>                 Key: ARROW-39
>                 URL: https://issues.apache.org/jira/browse/ARROW-39
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Wes McKinney
>
> Implementing algorithms on large arrays assembled in physical chunks is 
> problematic if:
> - The chunks are not all the same size (except possibly the last chunk, which 
> can be less). Otherwise, retrieving a particular element is in general a 
> O(log num_chunks) operation
> - The chunk size is not a power of 2. Computing integer modulus with a 
> non-multiple of 2 requires more clock cycles (in other words, {{i % p}} is 
> much more expensive to compute than {{i & (p - 1)}}, but the latter only 
> works if p is a power of 2)
> Most of the Arrow data adapters will either feature contiguous data (1 chunk, 
> so chunking is not an issue) or a regular chunk size, so this isn't as much 
> of an immediate concern, but we should consider making it a contract of any 
> data structures dealing in multiple arrays. 
> In general, it would be preferable to reorganize memory into either a regular 
> chunksize (like 64K values per chunk) or a contiguous memory region. I would 
> prefer for the moment to not to invest significant energy in writing 
> algorithms for data with irregular chunk sizes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to