zero323 commented on pull request #34951:
URL: https://github.com/apache/spark/pull/34951#issuecomment-997857411
> Wow .. this is a big change. cc @ueshin @BryanCutler @viirya too FYI. I
will review this closely within this week 👍
Just to make the context clear so it doesn't look like a flimsy ‒ this alone
is nice to have IMHO, because `pyspark.sql.functions` are the fastest growing
module and adding new functions should require as little effort as possible.
But there is another opportunity here ‒ with this, and few minor additional
tweaks, we can can cache `_get_jvm_function`
```python
from functools import lru_cache
@lru_cache(192)
def _get_jvm_function(name: str, sc: SparkContext) -> Callable:
...
```
significantly reducing the overhead of invoking the whole thing.
In simple experiments this:
```python
>>> %%timeit -n 100
... for i in range(1000): col(f"a{i}")
...
```
currently takes
```
1.08 s ± 9.44 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
and after caching function retrieval:
```
116 ms ± 3.73 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
Maybe not a big deal for batch applications, but if you have complex query
in `foreachBatch`, it is a nice gain (and it is still around order of magnitude
difference).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]