[ 
https://issues.apache.org/jira/browse/SPARK-54285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-54285.
-----------------------------------
    Fix Version/s: 4.2.0
       Resolution: Fixed

Issue resolved by pull request 52980
[https://github.com/apache/spark/pull/52980]

> Timestamp conversion is taking too long in Python UDF
> -----------------------------------------------------
>
>                 Key: SPARK-54285
>                 URL: https://issues.apache.org/jira/browse/SPARK-54285
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 4.1.0
>            Reporter: Tian Gao
>            Assignee: Tian Gao
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.2.0
>
>         Attachments: image-2025-11-10-11-21-18-073.png
>
>
> Currently the timestamp conversion takes 500us in worker process (where it 
> takes about 200ns in daemon or other normal Python processes). Seems like 
> this is a known issue of glibc when a forked process tries to access 
> `localtime()` - the cache is somehow locked so it has to go through the slow 
> path every time.
> This could be reproduced with timestamp column `F.to_timestamp("event_time")`.
> We should do a workaround for this - to get the timezone info in Python and 
> pass that for time conversion. The conversion cost would return back to 
> normal after the fix.
> !image-2025-11-10-11-21-18-073.png|width=649,height=326!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to