ueshin opened a new pull request, #52926:
URL: https://github.com/apache/spark/pull/52926
### What changes were proposed in this pull request?
Makes `PySparkLogger` in UDFs store one log entry per log function call.
### Why are the changes needed?
Currently if `PySparkLogger` is used in UDFs, it will produce two entries
per one log function call because it automatically adds a handler that writes
to `sys.stderr`, which causes two entries.
It doesn't need the additional handler if it's in the `capture_outputs`
context.
<details>
<summary>example</summary>
```python
>>> from pyspark.sql.functions import *
>>> from pyspark.logger import PySparkLogger
>>>
>>> @udf
... def pyspark_logger_test_udf(x):
... logger = PySparkLogger.getLogger("test")
... logger.warn(f"WARN level message: {x}", x=x)
... return str(x)
...
>>>
>>> spark.conf.set("spark.sql.pyspark.worker.logging.enabled", True)
>>>
>>> spark.range(1).select(pyspark_logger_test_udf("id")).show()
...
```
</details>
- before
```py
>>>
spark.table("system.session.python_worker_logs").orderBy("ts").show(truncate=False)
+--------------------------+-------+----------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+---------+------+
|ts |level |msg
|context |exception|logger|
+--------------------------+-------+----------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+---------+------+
|2025-11-06 18:40:03.658127|WARNING|WARN level message: 0
|{func_name -> pyspark_logger_test_udf, x -> 0}|NULL |test |
|2025-11-06 18:40:03.66424 |ERROR |{"ts": "2025-11-06 18:40:03.658",
"level": "WARNING", "logger": "test", "msg": "WARN level message: 0",
"context": {"x": 0}}|{func_name -> pyspark_logger_test_udf} |NULL
|stderr|
+--------------------------+-------+----------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+---------+------+
```
- after
```py
>>>
spark.table("system.session.python_worker_logs").orderBy("ts").show(truncate=False)
+--------------------------+-------+---------------------+----------------------------------------------+---------+------+
|ts |level |msg |context
|exception|logger|
+--------------------------+-------+---------------------+----------------------------------------------+---------+------+
|2025-11-06 18:41:48.601256|WARNING|WARN level message: 0|{func_name ->
pyspark_logger_test_udf, x -> 0}|NULL |test |
+--------------------------+-------+---------------------+----------------------------------------------+---------+------+
```
### Does this PR introduce _any_ user-facing change?
Yes, `PySparkLogger` in UDFs will store one log entry per one log function
call.
### How was this patch tested?
Added the related tests.
### Was this patch authored or co-authored using generative AI tooling?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]