[
https://issues.apache.org/jira/browse/ARROW-39?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231675#comment-15231675
]
Jacques Nadeau commented on ARROW-39:
-------------------------------------
Can you expound here? I'm not sure what you mean by "chunk". If you're speaking
about batches of records, I don't think fixed record batch sizes should be a
requirement.
> C++: Logical chunked arrays / columns: conforming to fixed chunk sizes
> ----------------------------------------------------------------------
>
> Key: ARROW-39
> URL: https://issues.apache.org/jira/browse/ARROW-39
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++
> Reporter: Wes McKinney
>
> Implementing algorithms on large arrays assembled in physical chunks is
> problematic if:
> - The chunks are not all the same size (except possibly the last chunk, which
> can be less). Otherwise, retrieving a particular element is in general a
> O(log num_chunks) operation
> - The chunk size is not a power of 2. Computing integer modulus with a
> non-multiple of 2 requires more clock cycles (in other words, {{i % p}} is
> much more expensive to compute than {{i & (p - 1)}}, but the latter only
> works if p is a power of 2)
> Most of the Arrow data adapters will either feature contiguous data (1 chunk,
> so chunking is not an issue) or a regular chunk size, so this isn't as much
> of an immediate concern, but we should consider making it a contract of any
> data structures dealing in multiple arrays.
> In general, it would be preferable to reorganize memory into either a regular
> chunksize (like 64K values per chunk) or a contiguous memory region. I would
> prefer for the moment to not to invest significant energy in writing
> algorithms for data with irregular chunk sizes.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)