[
https://issues.apache.org/jira/browse/SPARK-54285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ruifeng Zheng resolved SPARK-54285.
-----------------------------------
Fix Version/s: 4.2.0
Resolution: Fixed
Issue resolved by pull request 52980
[https://github.com/apache/spark/pull/52980]
> Timestamp conversion is taking too long in Python UDF
> -----------------------------------------------------
>
> Key: SPARK-54285
> URL: https://issues.apache.org/jira/browse/SPARK-54285
> Project: Spark
> Issue Type: Improvement
> Components: PySpark
> Affects Versions: 4.1.0
> Reporter: Tian Gao
> Assignee: Tian Gao
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.2.0
>
> Attachments: image-2025-11-10-11-21-18-073.png
>
>
> Currently the timestamp conversion takes 500us in worker process (where it
> takes about 200ns in daemon or other normal Python processes). Seems like
> this is a known issue of glibc when a forked process tries to access
> `localtime()` - the cache is somehow locked so it has to go through the slow
> path every time.
> This could be reproduced with timestamp column `F.to_timestamp("event_time")`.
> We should do a workaround for this - to get the timezone info in Python and
> pass that for time conversion. The conversion cost would return back to
> normal after the fix.
> !image-2025-11-10-11-21-18-073.png|width=649,height=326!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]