geruh commented on issue #1259: URL: https://github.com/apache/iceberg-python/issues/1259#issuecomment-3060166331
They added support for the arange function in pyarrow in https://github.com/apache/arrow/pull/46778/files. I pulled the latest Arrow branch, built it locally, and integrated it into PyIceberg. The updated implementation would looks like this: ```python def _combine_positional_deletes(positional_deletes: List[pa.ChunkedArray], start_index: int, end_index: int) -> pa.Array: if len(positional_deletes) == 1: all_chunks = positional_deletes[0] else: all_chunks = pa.chunked_array(itertools.chain(*[arr.chunks for arr in positional_deletes])) full_range = pa.arange(start_index, end_index) result = pc.filter(full_range, pc.invert(pc.is_in(full_range, value_set=all_chunks))) return pc.subtract(result, pa.scalar(start_index)) ``` Using the benchmark from @corleyma, I get roughly: ``` Testing range from 0 to 10000000 (10000000 elements) Python average: 0.6944s Cython average: 0.0096s Speedup: 71.96x ``` Maybe in the next arrow release we can add this if an upgrade isn't to burdensome. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
