nickstanishadb commented on PR #44678:
URL: https://github.com/apache/spark/pull/44678#issuecomment-1887363721
@dtenedor I did some dumb memory profiling running the following UDTF
```python
from pyspark.sql.functions import udtf
import resource
@udtf(returnType="step: int, memory: int")
class SimpleUDTF:
def __init__(self, *args, **kwargs):
self.step_id = 0
@staticmethod
def get_peak_memory_usage_kb() -> int:
return resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
def eval(self, *args, **kwargs):
yield self.step_id, self.get_peak_memory_usage_kb()
self.step_id += 1
def terminate(self, *args, **kwargs):
yield self.step_id, self.get_peak_memory_usage_kb()
spark.udtf.register("pyUdtfMemProfile", SimpleUDTF)
```
I'm not entirely confident in the results because I noticed running a high
memory UDTF before the profiling UDTF would increase the peak memory usage (I
guess they share a process?). But on a fresh 14.2 cluster I ran this and got
max memory usage across 100 UDTFs of `45440 KB`, which seems reasonable. Peak
memory usage of my vanilla conda distribution on my mac is 16400 KB and it
makes sense that UDTF framework would need extra stuff loaded into memory.
What do you think about designing UDTFs so that we take a save overestimate
of this (say 50000 KB) as the memory floor handled entirely by Scala (i.e. the
UDTF framework will always retry if it can't get at least this amount of
memory). Then in the user-provided part of the code, users can specify how much
_**additional memory**_ they would like to request just for holding their class
instance attributes (e.g. for AI_FORECAST, just the size of the training data).
If this sounds reasonable to you, we could also add some unit tests making
sure that the no-modifications Py UDTF memory requirements don't exceed 50KB
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]