owenowenisme opened a new issue, #49214:
URL: https://github.com/apache/arrow/issues/49214
### Describe the enhancement requested
### Problem
The `is_in` kernel (and other set lookup operations) rebuilds the internal
hash table (MemoTable) from `value_set` on **every invocation**, even when
`value_set` is identical across calls. There is currently no API to pre-build
and reuse the hash table.
This becomes a significant bottleneck when the same `value_set` is used to
filter many arrays (e.g., in a streaming/batch processing pipeline).
### Use Case
In Ray Data Checkpointing, we want to use `pc.is_in` to perform an
anti-join: filtering out already-processed row IDs from each incoming block of
data. The checkpoint ID set (`value_set`) is the same across all blocks, but
the hash table is rebuilt for every block.
```python
# value_set is the same across all calls — hash table is rebuilt each time
for block in blocks:
membership = pc.is_in(block["id"], value_set=checkpoint_ids)
filtered = block.filter(pc.invert(membership))
```
With 100M checkpoint IDs and 20 blocks, the hash table construction alone
takes **~250s** (extrapolating from #36059 benchmarks), doing identical work 20
times.
### Benchmark
Building on the benchmarks from #36059, the cost of rebuilding becomes clear
when multiplied across calls:
```python
import pyarrow as pa
import pyarrow.compute as pc
import numpy as np
value_set = pa.array(np.arange(10_000_000))
blocks = [pa.array(np.random.choice(np.arange(1000), 100)) for _ in
range(20)]
# Current: ~25s total (rebuilds hash table 20 times)
for block in blocks:
pc.is_in(block, value_set)
# Ideal: ~1.3s (build once, probe 20 times)
```
### Proposed API
**Option A — Prepared/materialized value set object:**
```python
# Build hash table once
lookup = pc.SetLookup(value_set) # or pc.prepare_set(value_set)
# Reuse across calls — probes only, no rebuild
for block in blocks:
mask = lookup.is_in(block)
```
**Option B — Caching in SetLookupOptions:**
```python
options = pc.SetLookupOptions(value_set, cache=True)
# First call builds hash table; subsequent calls reuse it
for block in blocks:
mask = pc.is_in(block, options=options)
```
**Option C — Lower-level hash set exposure:**
```python
hash_set = pc.HashSet(value_set) # builds MemoTable
for block in blocks:
mask = hash_set.is_in(block)
# or: mask = pc.is_in(block, options=pc.SetLookupOptions(hash_set))
```
### Impact
This would benefit any workload that probes the same set repeatedly:
- Streaming anti-joins / semi-joins
- Incremental processing (checkpoint filtering)
- Repeated dictionary lookups
- Any ETL pipeline filtering against a static blocklist/allowlist
### Related
- #36059 — Performance of building up HashTable (MemoTable) in `is_in`
kernel (addressed MemoTable speed, but not reuse)
### Component(s)
C++
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]