Jeadie opened a new pull request, #10236:
URL: https://github.com/apache/arrow-rs/pull/10236
# Which issue does this PR close?
<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases. You can
link an issue to this PR using the GitHub syntax.
-->
- No issue yet. Happy to open if changes wanted.
# Rationale for this change
<!--
Why are you proposing this change? If this is already explained clearly in
the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand your
changes and offer better suggestions for fixes.
-->
This PR improves the performance of `FilterPredicate::filter` for array
based data types, specifically: `List<T>`, `FixedSizeList<T>`, `Map<T>`.
This optimisation is based on one idea: translate retained (i.e. filter =
true) parent-row runs into child element ranges (trivially contiguous due to
how list/fixed/map layouts work), then hand those ranges to a already-fast
child kernels rather than copying element-by-element.
`filter` is one of the most-executed kernels in Apache DataFusion, and now
these list/nested types have fast path. Several common Datafusion uses are
especially impacted:
- `FixedSizeList` as embeddings or vectors
- JSON or nested data (`array_agg`, `Unnest`)
- `GROUP BY` operations
- Hash/sort-merge joins filter probe/build columns on list types
# What changes are included in this PR?
<!--
There is no need to duplicate the description in the issue here but it is
sometimes worth providing a summary of the individual changes in this PR.
-->
1. Specialisation within `FilterPredicate::filter` for
`DataType::FixedSizeList`, `DataType::Map`, `DataType::List` and
`DataType::LargeList` (the latter two are only specialised for certain/most
child types).
2. Associated benchmarks in `arrow/benches/filter_kernels.rs`.
## Changes Explained
### Before
Prior to this PR, `List<T>`, used the `MutableArrayData` fallback.
Example
```
filter: [ T F T T F ]
parent rows: [row0|row1|row2|row3|row4]
child values: [a b c|d e|f g h|i j k|l m n o]
```
MutableArrayData walks the full child buffer, copying by range for each
retained row.
### After
```
filter: [ T F T T F ]
row0 row1 row2 row3 row4
offsets: [ 0 3 5 8 11 15 ]
predicate_row_ranges → [(0,1), (2,4)] ← runs of kept rows
child_ranges:
run (0,1): offsets[0]..offsets[1] = [0, 3) ← 1 parent element
run (2,4): offsets[2]..offsets[4] = [5, 11) ← 2 parent elements. rows 2+3
merged into ONE range
child values: [a b c|d e|f g h|i j k|l m n o]
╰─────╯ ╰───────────╯
[0, 3) [5, 11)
Rebuild new offsets from retained row lengths:
row0: 3-0=3 → new_offsets: [0, 3]
row2: 8-5=3 → new_offsets: [0, 3, 6]
row3: 11-8=3 → new_offsets: [0, 3, 6, 9]
output: List([ [a,b,c], [f,g,h], [i,j,k] ])
```
### Non-specialised Child types.
`List<T>` is not specialised for some `T` child types (and similarly other
array types mentioned). This PR specialises if the child type `T` has a fast,
vectorized kernel for it that is driven only by the predicate's `Slices` (never
reads `filter` directly). Everything else uses the well-tuned, correct
`MutableArrayData` fallback.
| Child type | Why it stays on fallback |
|---|---|
| dense `Union` | Consecutive rows carry different type-ids and
non-contiguous child offsets, no contiguous ranges to copy |
| `RunEndEncoded` | Its kernel (`filter_run_end_array`) reads
`predicate.filter` **directly**. The specialization streams ranges via a
`Slices` predicate whose `filter` is intentionally empty. |
| **unmeasured exotics** | Several exotics not benchmarked stay on fallback
by default. |
Every other list child *is* specialized: primitives, boolean, null,
`Utf8`/`LargeUtf8`/`Binary`/`LargeBinary`, `Utf8View`/`BinaryView`,
`FixedSizeBinary`, `FixedSizeList`, `Dictionary`, `Struct`, sparse `Union`,
`ListView`/`LargeListView`, and nested `List`/`LargeList`.
# Are these changes tested?
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code
If tests are not included in your PR, please explain why (for example, are
they covered by existing tests)?
If this PR claims a performance improvement, please include evidence such as
benchmark results.
-->
Tests in `arrow-select/src/filter.rs`.
## Benchmark results
```shell
>> cargo bench -p arrow --bench filter_kernels \
--features test_utils \
--baseline after \
-- "filter (list|fixedsizelist|map)"
```
- `size = 65536`
- Cells: `before → after (speedup)`, where
- before = `MutableArrayData` fallback
- after = specialized.
- ⚠ marks a regression (sub-1.0).
### `List<T>` by child type
| Child | kept ½ | kept 1023/1024 | kept 1/1024 |
|---|---|---|---|
| Int32 | 433→373 µs (1.16×) | 128→93 µs (1.38×) | 2.92→1.12 µs (2.60×) |
| Utf8 | 627→588 µs (1.07×) | 455→471 µs (0.97× ⚠) | 3.57→1.44 µs (2.48×) |
| LargeUtf8 | 679→593 µs (1.14×) | 644→474 µs (1.36×) | 3.58→1.44 µs (2.49×)
|
| Binary | 623→581 µs (1.07×) | 451→469 µs (0.96× ⚠) | 3.49→1.44 µs (2.43×) |
| LargeBinary | 656→590 µs (1.11×) | 628→470 µs (1.34×) | 3.58→1.44 µs
(2.48×) |
| Utf8View | 598→315 µs (1.90×) | 566→183 µs (3.10×) | 3.16→0.94 µs (3.35×) |
| FixedSizeBinary | 464→348 µs (1.33×) | 213→118 µs (1.80×) | 2.87→1.03 µs
(2.79×) |
| FixedSizeList | 540→446 µs (1.21×) | 216→134 µs (1.61×) | 3.61→1.42 µs
(2.54×) |
| Dictionary | 482→370 µs (1.30×) | 126→94 µs (1.34×) | 3.36→1.37 µs (2.45×)
|
| Struct | 519→371 µs (1.40×) | 128→93 µs (1.37×) | 3.69→1.12 µs (3.28×) |
| Map | 1107→998 µs (1.11×) | 780→648 µs (1.20×) | 6.32→2.89 µs (2.19×) |
| Union (sparse) | 832→663 µs (1.26×) | 324→168 µs (1.93×) | 5.41→2.30 µs
(2.35×) |
| ListView | 1702→548 µs (3.11×) | 2687→128 µs (20.9×) | 5.85→1.41 µs
(4.13×) |
| nested `List<List>` | 612→620 µs (0.99× ⚠) | 320→295 µs (1.09×) |
3.93→1.73 µs (2.27×) |
### Direct kernels (new)
| Kernel | kept ½ | kept 1023/1024 | kept 1/1024 |
|---|---|---|---|
| `filter FixedSizeList` | 309→197 µs (1.57×) | 20.8→21.6 µs (0.96× ⚠) |
3.27→0.69 µs (4.76×) |
| `filter Map` | 791→648 µs (1.22×) | 286→291 µs (0.98×) | 5.01→1.88 µs
(2.66×) |
### Value-length sweep — `List<Utf8>` @ kept ½
| Value length | before → after | speedup |
|---|---|---|
| 8 B | 669→591 µs | 1.13× |
| 64 B | 1241→1074 µs | 1.16× |
| 256 B | 4796→3167 µs | 1.51× |
| 10× rows (short) | 6598→6446 µs | 1.02× |
### Regressions / caveats
All sub-1.0 results occur only at the dense `kept 1023/1024` end (rare for
selective predicates), plus the nested-list `½` tie:
- **dense:** Utf8 0.97×, Binary 0.96×, `filter FixedSizeList` 0.96×
(memcpy-bound — the fallback is already tight there).
- **nested `List<List>` @ ½: 0.99×** (offset-dominated; ties the fallback,
then wins 1.09×/2.27× at the other selectivities).
No remaining regression exceeds ~4%. Every `kept 1/1024` (highly selective)
case is a 2.2–4.8× win.
# Are there any user-facing changes?
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
If there are any breaking changes to public APIs, please call them out.
-->
N/A.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]