[
https://issues.apache.org/jira/browse/ARROW-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Eduardo Ponce updated ARROW-7394:
---------------------------------
Fix Version/s: 8.0.0
(was: 7.0.0)
> [C++][DataFrame] Implement zero-copy optimizations when performing Filter
> -------------------------------------------------------------------------
>
> Key: ARROW-7394
> URL: https://issues.apache.org/jira/browse/ARROW-7394
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Wes McKinney
> Assignee: Eduardo Ponce
> Priority: Major
> Labels: dataframe
> Fix For: 8.0.0
>
>
> For high-selectivity filters (most elements included), it may be wasteful and
> slow to copy large contiguous ranges of array chunks into the resulting
> ChunkedArray. Instead, we can scan the filter boolean array and slice off
> chunks of the source data rather than copying.
> We will need to empirically determine how large the contiguous range needs to
> be in order to merit the slice-based approach versus simple/native
> materialization. For example, in a filter array like
> 1 0 1 0 1 0 1 0 1
> it would not make sense to slice 5 times because slicing carries some
> overhead. But if we had
> 1 ... 1 [100 1's] 0 1 ... 1 [100 1's] 0 1 ... 1 [100 1's] 0 1 ... 1 [100 1's]
> then performing 4 slices may be faster than doing a copy materialization.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)