[jira] [Resolved] (SPARK-54384) Modernize the _batched method for BatchedSerializer

Hyukjin Kwon (Jira) Mon, 17 Nov 2025 13:49:11 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-54384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon resolved SPARK-54384.
----------------------------------
    Fix Version/s: 4.2.0
       Resolution: Fixed

Issue resolved by pull request 53086
[https://github.com/apache/spark/pull/53086]

> Modernize the _batched method for BatchedSerializer 
> ----------------------------------------------------
>
>                 Key: SPARK-54384
>                 URL: https://issues.apache.org/jira/browse/SPARK-54384
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 4.1.0
>            Reporter: Tian Gao
>            Assignee: Tian Gao
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 4.2.0
>
>
> We have `itertools` utilities which could make the iterator operations much 
> faster and less verbose.
> {code:java}
> import itertools
> import time
> def batch_original(iterator, batch_size):
>     items = []
>     count = 0
>     for item in iterator:
>         items.append(item)
>         count += 1
>         if count == batch_size:
>             yield items
>             items = []
>             count = 0
>     if items:
>         yield items
> def batch_list(iterator, batch_size):
>     n = len(iterator)
>     for i in range(0, n, batch_size):
>         yield iterator[i : i + batch_size]
> def batch_after(iterator, batch_size):
>     it = iter(iterator)
>     while batch := list(itertools.islice(it, batch_size)):
>         yield batch
> def do_test(iterator, batch):
>     result = []
>     start = time.perf_counter_ns()
>     for b in batch(iterator, 10000):
>         result.append(b)
>     end = time.perf_counter_ns()
>     print(f"Batching {batch.__name__} took {(end - start)/1e9:.4f} seconds")
>     return result
> if __name__ == "__main__":
>     data = range(10000005)
>     result_original = do_test(data, batch_original)
>     result_after = do_test(data, batch_after)
>     assert result_original == result_after
>     data = list(range(10000005))
>     result_list = do_test(data, batch_list)
>     result_after = do_test(data, batch_after)
>     assert result_list == result_afterNotice that __getslice__ is remo {code}
> Notice that {{__getslice__}} is *removed* since Python 3.0, so the 
> optimization for known size iterators like lists is not working at all. 
> There's no simple way to know if an iterator supports slice operation now. 
> The most straightforward way is to try it out like {{iterator[:1]}} - I don't 
> know how frequent we are dealing with lists, if the iterator is often lists, 
> then we can do it. The raw {{[:]}} operation is 22% faster than this 
> implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SPARK-54384) Modernize the _batched method for BatchedSerializer

Reply via email to