owenowenisme opened a new issue, #49214:
URL: https://github.com/apache/arrow/issues/49214

   ### Describe the enhancement requested
   
   ### Problem
   
   The `is_in` kernel (and other set lookup operations) rebuilds the internal 
hash table (MemoTable) from `value_set` on **every invocation**, even when 
`value_set` is identical across calls. There is currently no API to pre-build 
and reuse the hash table.
   
   This becomes a significant bottleneck when the same `value_set` is used to 
filter many arrays (e.g., in a streaming/batch processing pipeline).
   
   ### Use Case
   
   In Ray Data Checkpointing, we want to use `pc.is_in` to perform an 
anti-join: filtering out already-processed row IDs from each incoming block of 
data. The checkpoint ID set (`value_set`) is the same across all blocks, but 
the hash table is rebuilt for every block.
   
   ```python
   # value_set is the same across all calls — hash table is rebuilt each time
   for block in blocks:
       membership = pc.is_in(block["id"], value_set=checkpoint_ids)
       filtered = block.filter(pc.invert(membership))
   ```
   
   With 100M checkpoint IDs and 20 blocks, the hash table construction alone 
takes **~250s** (extrapolating from #36059 benchmarks), doing identical work 20 
times.
   
   ### Benchmark
   
   Building on the benchmarks from #36059, the cost of rebuilding becomes clear 
when multiplied across calls:
   
   ```python
   import pyarrow as pa
   import pyarrow.compute as pc
   import numpy as np
   
   value_set = pa.array(np.arange(10_000_000))
   blocks = [pa.array(np.random.choice(np.arange(1000), 100)) for _ in 
range(20)]
   
   # Current: ~25s total (rebuilds hash table 20 times)
   for block in blocks:
       pc.is_in(block, value_set)
   
   # Ideal: ~1.3s (build once, probe 20 times)
   ```
   
   ### Proposed API
   
   **Option A — Prepared/materialized value set object:**
   
   ```python
   # Build hash table once
   lookup = pc.SetLookup(value_set)  # or pc.prepare_set(value_set)
   
   # Reuse across calls — probes only, no rebuild
   for block in blocks:
       mask = lookup.is_in(block)
   ```
   
   **Option B — Caching in SetLookupOptions:**
   
   ```python
   options = pc.SetLookupOptions(value_set, cache=True)
   
   # First call builds hash table; subsequent calls reuse it
   for block in blocks:
       mask = pc.is_in(block, options=options)
   ```
   
   **Option C — Lower-level hash set exposure:**
   
   ```python
   hash_set = pc.HashSet(value_set)  # builds MemoTable
   
   for block in blocks:
       mask = hash_set.is_in(block)
       # or: mask = pc.is_in(block, options=pc.SetLookupOptions(hash_set))
   ```
   
   ### Impact
   
   This would benefit any workload that probes the same set repeatedly:
   
   - Streaming anti-joins / semi-joins
   - Incremental processing (checkpoint filtering)
   - Repeated dictionary lookups
   - Any ETL pipeline filtering against a static blocklist/allowlist
   
   ### Related
   
   - #36059 — Performance of building up HashTable (MemoTable) in `is_in` 
kernel (addressed MemoTable speed, but not reuse)
   
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to