geruh commented on issue #1271:
URL: 
https://github.com/apache/iceberg-python/issues/1271#issuecomment-3060167245

   They added support for the arange function in pyarrow in 
https://github.com/apache/arrow/pull/46778/files. I pulled the latest Arrow 
branch, built it locally, and integrated it into PyIceberg. The updated 
implementation would looks like this:
   
   ```python
   def _combine_positional_deletes(positional_deletes: List[pa.ChunkedArray], 
start_index: int, end_index: int) -> pa.Array:
       if len(positional_deletes) == 1:
           all_chunks = positional_deletes[0]
       else:
           all_chunks = pa.chunked_array(itertools.chain(*[arr.chunks for arr 
in positional_deletes]))
   
       full_range = pa.arange(start_index, end_index)
   
       result = pc.filter(full_range, pc.invert(pc.is_in(full_range, 
value_set=all_chunks)))
   
       return pc.subtract(result, pa.scalar(start_index))
   ```
   
   Using the benchmark from @corleyma, I get roughly:
   
   ```
   Testing range from 0 to 10000000 (10000000 elements)
   Python average: 0.6944s
   Cython average: 0.0096s
   Speedup: 71.96x
   ```
   
   Maybe in the next arrow release we can add this if an upgrade isn't to 
burdensome.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to