zero323 commented on pull request #34951:
URL: https://github.com/apache/spark/pull/34951#issuecomment-997857411


   > Wow .. this is a big change. cc @ueshin @BryanCutler @viirya too FYI. I 
will review this closely within this week 👍
   
   Just to make the context clear so it doesn't look like a flimsy ‒ this alone 
is nice to have IMHO, because `pyspark.sql.functions` are the fastest growing 
module and adding new functions should require as little effort as possible.
   
   But there is another opportunity here ‒ with this, and few minor additional 
tweaks, we can can cache `_get_jvm_function`
   
   ```python
   from functools import lru_cache
   
   
   @lru_cache(192)
   def _get_jvm_function(name: str, sc: SparkContext) -> Callable:
      ...
   ```
    
   significantly reducing the overhead of invoking the whole thing.
   
   In simple experiments this:
   
   ```python
   >>> %%timeit -n 100
   ... for i in range(1000): col(f"a{i}")
   ...
   ```
   
   currently takes
   
   ```
   1.08 s ± 9.44 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
   ```
   
   and after caching function retrieval:
   
   ```
   116 ms ± 3.73 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
   ```
   
   
   Maybe not a big deal for batch applications, but if you have complex query 
in `foreachBatch`, it is a nice gain (and it is still around order of magnitude 
difference).
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to