[
https://issues.apache.org/jira/browse/SPARK-54384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-54384.
----------------------------------
Fix Version/s: 4.2.0
Resolution: Fixed
Issue resolved by pull request 53086
[https://github.com/apache/spark/pull/53086]
> Modernize the _batched method for BatchedSerializer
> ----------------------------------------------------
>
> Key: SPARK-54384
> URL: https://issues.apache.org/jira/browse/SPARK-54384
> Project: Spark
> Issue Type: Improvement
> Components: PySpark
> Affects Versions: 4.1.0
> Reporter: Tian Gao
> Assignee: Tian Gao
> Priority: Minor
> Labels: pull-request-available
> Fix For: 4.2.0
>
>
> We have `itertools` utilities which could make the iterator operations much
> faster and less verbose.
> {code:java}
> import itertools
> import time
> def batch_original(iterator, batch_size):
> items = []
> count = 0
> for item in iterator:
> items.append(item)
> count += 1
> if count == batch_size:
> yield items
> items = []
> count = 0
> if items:
> yield items
> def batch_list(iterator, batch_size):
> n = len(iterator)
> for i in range(0, n, batch_size):
> yield iterator[i : i + batch_size]
> def batch_after(iterator, batch_size):
> it = iter(iterator)
> while batch := list(itertools.islice(it, batch_size)):
> yield batch
> def do_test(iterator, batch):
> result = []
> start = time.perf_counter_ns()
> for b in batch(iterator, 10000):
> result.append(b)
> end = time.perf_counter_ns()
> print(f"Batching {batch.__name__} took {(end - start)/1e9:.4f} seconds")
> return result
> if __name__ == "__main__":
> data = range(10000005)
> result_original = do_test(data, batch_original)
> result_after = do_test(data, batch_after)
> assert result_original == result_after
> data = list(range(10000005))
> result_list = do_test(data, batch_list)
> result_after = do_test(data, batch_after)
> assert result_list == result_afterNotice that __getslice__ is remo {code}
> Notice that {{__getslice__}} is *removed* since Python 3.0, so the
> optimization for known size iterators like lists is not working at all.
> There's no simple way to know if an iterator supports slice operation now.
> The most straightforward way is to try it out like {{iterator[:1]}} - I don't
> know how frequent we are dealing with lists, if the iterator is often lists,
> then we can do it. The raw {{[:]}} operation is 22% faster than this
> implementation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]