nickstanishadb commented on PR #44678:
URL: https://github.com/apache/spark/pull/44678#issuecomment-1887363721

   @dtenedor I did some dumb memory profiling running the following UDTF
   
   ```python
   from pyspark.sql.functions import udtf
   import resource
   
   @udtf(returnType="step: int, memory: int")
   class SimpleUDTF:
       def __init__(self, *args, **kwargs):
           self.step_id = 0
       
       @staticmethod
       def get_peak_memory_usage_kb() -> int:
           return resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
   
       def eval(self, *args, **kwargs):
           yield self.step_id, self.get_peak_memory_usage_kb()
           self.step_id += 1
   
       def terminate(self, *args, **kwargs):
           yield self.step_id, self.get_peak_memory_usage_kb()
   
   spark.udtf.register("pyUdtfMemProfile", SimpleUDTF)
   ```
   
   I'm not entirely confident in the results because I noticed running a high 
memory UDTF before the profiling UDTF would increase the peak memory usage (I 
guess they share a process?). But on a fresh 14.2 cluster I ran this and got 
max memory usage across 100 UDTFs of `45440 KB`, which seems reasonable. Peak 
memory usage of my vanilla conda distribution on my mac is 16400 KB and it 
makes sense that UDTF framework would need extra stuff loaded into memory.
   
   What do you think about designing UDTFs so that we take a save overestimate 
of this (say 50000 KB) as the memory floor handled entirely by Scala (i.e. the 
UDTF framework will always retry if it can't get at least this amount of 
memory). Then in the user-provided part of the code, users can specify how much 
_**additional memory**_ they would like to request just for holding their class 
instance attributes (e.g. for AI_FORECAST, just the size of the training data).
   
   If this sounds reasonable to you, we could also add some unit tests making 
sure that the no-modifications Py UDTF memory requirements don't exceed 50KB


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to