[
https://issues.apache.org/jira/browse/SPARK-57579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-57579:
-----------------------------------
Labels: pull-request-available (was: )
> [SQL][PYTHON] Add PySpark support for unix_nanos function
> ---------------------------------------------------------
>
> Key: SPARK-57579
> URL: https://issues.apache.org/jira/browse/SPARK-57579
> Project: Spark
> Issue Type: Improvement
> Components: PySpark
> Affects Versions: 4.3.0
> Reporter: Jubin Soni
> Priority: Major
> Labels: pull-request-available
>
> *Problem*
> The unix_nanos() SQL function and Scala API were added in SPARK-57527, but
> PySpark support was explicitly deferred as a tracked follow-up.
> The full family of epoch-unit functions exists in PySpark except for the
> nanosecond member:
> {code:java}
> - unix_seconds -> pyspark.sql.functions.unix_seconds (present)
> - unix_millis -> pyspark.sql.functions.unix_millis (present)
> - unix_micros -> pyspark.sql.functions.unix_micros (present)
> - unix_nanos -> pyspark.sql.functions.unix_nanos (MISSING)
> {code}
> The gap is acknowledged in the parity test:
> python/pyspark/sql/tests/test_functions.py, expected_missing_in_py set:
> "unix_nanos", # SPARK-57527: PySpark support tracked as a follow-up
> *How to Reproduce*
> {code:java}
> from pyspark.sql import functions as sf
> df = spark.sql(
> "SELECT TIMESTAMP_NTZ '2020-01-01 13:24:35.123456789' AS ts"
> )
> df.select(sf.unix_nanos("ts")).show()
> # AttributeError: module 'pyspark.sql.functions' has no attribute
> 'unix_nanos'
> {code}
> The SQL path works fine:
> {code:java}
> spark.sql("SELECT unix_nanos(TIMESTAMP_NTZ '2020-01-01
> 13:24:35.123456789')")
> # returns 1577884675123456789 as DECIMAL(21, 0) -- correct
> {code}
> *Expected:* sf.unix_nanos(col) is available and returns the same result as
> the SQL unix_nanos() function (DECIMAL(21,0) nanoseconds since epoch).
> *Actual:* AttributeError — function is not exposed in the PySpark API.
> *Work Needed*
> 1. python/pyspark/sql/functions/builtin.py
> Add unix_nanos() function after unix_micros (line ~11749), following the
> same pattern as unix_micros:
> {code:java}
> @_try_remote_functions
> def unix_nanos(col: "ColumnOrName") -> Column:
> """Returns the number of nanoseconds since 1970-01-01 00:00:00 UTC
> as DECIMAL(21, 0). Only supports TIMESTAMP_LTZ(p) and
> TIMESTAMP_NTZ(p)
> with precision p in [7, 9].
> ...
> """
> return _invoke_function_over_columns("unix_nanos", col){code}
> 2. python/pyspark/sql/functions/{_}{{_}}init{{_}}{_}.py
> Export unix_nanos in the {_}{{_}}init{{_}}{_} alongside
> unix_micros/millis/seconds.
> 3. python/pyspark/sql/connect/functions/builtin.py
> Add the Connect-side wrapper for unix_nanos, following the same structure
> as unix_micros in that file.
> 4. python/pyspark/sql/tests/test_functions.py
> Remove "unix_nanos" from the expected_missing_in_py set (and its comment).
> 5. Add a doctest in the unix_nanos docstring covering:
> - A nanosecond-precision TIMESTAMP_NTZ input
> - A NULL input (returns NULL) following the style of unix_micros (lines
> 11735-11747).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]