Tian Gao created SPARK-54285:
--------------------------------

             Summary: Timestamp conversion is taking too long in Python UDF
                 Key: SPARK-54285
                 URL: https://issues.apache.org/jira/browse/SPARK-54285
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
    Affects Versions: 4.1.0
            Reporter: Tian Gao


Currently the timestamp conversion takes 500us in worker process (where it 
takes about 200ns in daemon or other normal Python processes). Seems like this 
is a known issue of glibc when a forked process tries to access `localtime()` - 
the cache is somehow locked so it has to go through the slow path every time.

This could be reproduced with timestamp column `F.to_timestamp("event_time")`.

We should do a workaround for this - to get the timezone info in Python and 
pass that for time conversion. The conversion cost would return back to normal 
after the fix.

!image-2025-11-10-11-17-07-698.png|width=582,height=292!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to