Tian Gao created SPARK-54285:
--------------------------------
Summary: Timestamp conversion is taking too long in Python UDF
Key: SPARK-54285
URL: https://issues.apache.org/jira/browse/SPARK-54285
Project: Spark
Issue Type: Improvement
Components: PySpark
Affects Versions: 4.1.0
Reporter: Tian Gao
Currently the timestamp conversion takes 500us in worker process (where it
takes about 200ns in daemon or other normal Python processes). Seems like this
is a known issue of glibc when a forked process tries to access `localtime()` -
the cache is somehow locked so it has to go through the slow path every time.
This could be reproduced with timestamp column `F.to_timestamp("event_time")`.
We should do a workaround for this - to get the timezone info in Python and
pass that for time conversion. The conversion cost would return back to normal
after the fix.
!image-2025-11-10-11-17-07-698.png|width=582,height=292!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]