Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/6714#issuecomment-113055264
After this patch, it looks like `_next_limit` is only called once per
`ExternalSorter.sorted` call. Since `_next_limit` is only called from there, I
wonder whether we should just remove that method and inline its code at the
call site in order to make things a bit easier to read.
This makes me wonder, though: is it actually safe to not allow the memory
limit to raise after spilling like it did before? Here's the comment in
`_next_limit`:
```python
def _next_limit(self):
"""
Return the next memory limit. If the memory is not released
after spilling, it will dump the data only when the used memory
starts to increase.
"""
return max(self.memory_limit, get_used_memory() * 1.05)
```
If we no longer call `_next_limit()` after spilling, what will happen in
cases where spilling somehow prevented memory from being freed? Does this mean
that we'll spill every `batch` items and pay a huge cost during merge?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]