[
https://issues.apache.org/jira/browse/SPARK-57579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jubin Soni updated SPARK-57579:
-------------------------------
Description:
*Problem*
The unix_nanos() SQL function and Scala API were added in SPARK-57527, but
PySpark support was explicitly deferred as a tracked follow-up.
The full family of epoch-unit functions exists in PySpark except for the
nanosecond member:
{code:java}
- unix_seconds -> pyspark.sql.functions.unix_seconds (present)
- unix_millis -> pyspark.sql.functions.unix_millis (present)
- unix_micros -> pyspark.sql.functions.unix_micros (present)
- unix_nanos -> pyspark.sql.functions.unix_nanos (MISSING)
{code}
The gap is acknowledged in the parity test:
python/pyspark/sql/tests/test_functions.py, expected_missing_in_py set:
"unix_nanos", # SPARK-57527: PySpark support tracked as a follow-up
*How to Reproduce*
{code:java}
from pyspark.sql import functions as sf
df = spark.sql(
"SELECT TIMESTAMP_NTZ '2020-01-01 13:24:35.123456789' AS ts"
)
df.select(sf.unix_nanos("ts")).show()
# AttributeError: module 'pyspark.sql.functions' has no attribute 'unix_nanos'
{code}
The SQL path works fine:
{code:java}
spark.sql("SELECT unix_nanos(TIMESTAMP_NTZ '2020-01-01 13:24:35.123456789')")
# returns 1577884675123456789 as DECIMAL(21, 0) -- correct
{code}
*Expected:* sf.unix_nanos(col) is available and returns the same result as the
SQL unix_nanos() function (DECIMAL(21,0) nanoseconds since epoch).
*Actual:* AttributeError — function is not exposed in the PySpark API.
*Work Needed*
1. python/pyspark/sql/functions/builtin.py
Add unix_nanos() function after unix_micros (line ~11749), following the
same pattern as unix_micros:
{code:java}
@_try_remote_functions
def unix_nanos(col: "ColumnOrName") -> Column:
"""Returns the number of nanoseconds since 1970-01-01 00:00:00 UTC
as DECIMAL(21, 0). Only supports TIMESTAMP_LTZ(p) and TIMESTAMP_NTZ(p)
with precision p in [7, 9].
...
"""
return _invoke_function_over_columns("unix_nanos", col){code}
2. python/pyspark/sql/functions/{_}{{_}}init{{_}}{_}.py
Export unix_nanos in the {_}{{_}}init{{_}}{_} alongside
unix_micros/millis/seconds.
3. python/pyspark/sql/connect/functions/builtin.py
Add the Connect-side wrapper for unix_nanos, following the same structure as
unix_micros in that file.
4. python/pyspark/sql/tests/test_functions.py
Remove "unix_nanos" from the expected_missing_in_py set (and its comment).
5. Add a doctest in the unix_nanos docstring covering:
- A nanosecond-precision TIMESTAMP_NTZ input
- A NULL input (returns NULL) following the style of unix_micros (lines
11735-11747).
was:
*Problem*
The unix_nanos() SQL function and Scala API were added in SPARK-57527, but
PySpark support was explicitly deferred as a tracked follow-up.
The full family of epoch-unit functions exists in PySpark except for the
nanosecond member:
{code:java}
- unix_seconds -> pyspark.sql.functions.unix_seconds (present)
- unix_millis -> pyspark.sql.functions.unix_millis (present)
- unix_micros -> pyspark.sql.functions.unix_micros (present)
- unix_nanos -> pyspark.sql.functions.unix_nanos (MISSING)
{code}
The gap is acknowledged in the parity test:
python/pyspark/sql/tests/test_functions.py, expected_missing_in_py set:
"unix_nanos", # SPARK-57527: PySpark support tracked as a follow-up
*How to Reproduce*
{code:java}
from pyspark.sql import functions as sf
df = spark.sql(
"SELECT TIMESTAMP_NTZ '2020-01-01 13:24:35.123456789' AS ts"
)
df.select(sf.unix_nanos("ts")).show()
# AttributeError: module 'pyspark.sql.functions' has no attribute 'unix_nanos'
{code}
The SQL path works fine:
{code:java}
spark.sql("SELECT unix_nanos(TIMESTAMP_NTZ '2020-01-01 13:24:35.123456789')")
# returns 1577884675123456789 as DECIMAL(21, 0) -- correct
{code}
*Expected:* sf.unix_nanos(col) is available and returns the same result as
the SQL unix_nanos() function (DECIMAL(21,0) nanoseconds since epoch).
*Actual:* AttributeError — function is not exposed in the PySpark API.
*Work Needed*
1. python/pyspark/sql/functions/builtin.py
Add unix_nanos() function after unix_micros (line ~11749), following the
same pattern as unix_micros:
{code:java}
@_try_remote_functions
def unix_nanos(col: "ColumnOrName") -> Column:
"""Returns the number of nanoseconds since 1970-01-01 00:00:00 UTC
as DECIMAL(21, 0). Only supports TIMESTAMP_LTZ(p) and TIMESTAMP_NTZ(p)
with precision p in [7, 9].
...
"""
return _invoke_function_over_columns("unix_nanos", col){code}
2. python/pyspark/sql/functions/_{_}init{_}_.py
Export unix_nanos in the _{_}init{_}_ alongside unix_micros/millis/seconds.
3. python/pyspark/sql/connect/functions/builtin.py
Add the Connect-side wrapper for unix_nanos, following the same structure
as unix_micros in that file.
4. python/pyspark/sql/tests/test_functions.py
Remove "unix_nanos" from the expected_missing_in_py set (and its comment).
5. Add a doctest in the unix_nanos docstring covering:
- A nanosecond-precision TIMESTAMP_NTZ input
- A NULL input (returns NULL)
following the style of unix_micros (lines 11735-11747).
> [SQL][PYTHON] Add PySpark support for unix_nanos function
> ---------------------------------------------------------
>
> Key: SPARK-57579
> URL: https://issues.apache.org/jira/browse/SPARK-57579
> Project: Spark
> Issue Type: Improvement
> Components: PySpark
> Affects Versions: 4.3.0
> Reporter: Jubin Soni
> Priority: Major
>
> *Problem*
> The unix_nanos() SQL function and Scala API were added in SPARK-57527, but
> PySpark support was explicitly deferred as a tracked follow-up.
> The full family of epoch-unit functions exists in PySpark except for the
> nanosecond member:
> {code:java}
> - unix_seconds -> pyspark.sql.functions.unix_seconds (present)
> - unix_millis -> pyspark.sql.functions.unix_millis (present)
> - unix_micros -> pyspark.sql.functions.unix_micros (present)
> - unix_nanos -> pyspark.sql.functions.unix_nanos (MISSING)
> {code}
> The gap is acknowledged in the parity test:
> python/pyspark/sql/tests/test_functions.py, expected_missing_in_py set:
> "unix_nanos", # SPARK-57527: PySpark support tracked as a follow-up
> *How to Reproduce*
> {code:java}
> from pyspark.sql import functions as sf
> df = spark.sql(
> "SELECT TIMESTAMP_NTZ '2020-01-01 13:24:35.123456789' AS ts"
> )
> df.select(sf.unix_nanos("ts")).show()
> # AttributeError: module 'pyspark.sql.functions' has no attribute
> 'unix_nanos'
> {code}
> The SQL path works fine:
> {code:java}
> spark.sql("SELECT unix_nanos(TIMESTAMP_NTZ '2020-01-01
> 13:24:35.123456789')")
> # returns 1577884675123456789 as DECIMAL(21, 0) -- correct
> {code}
> *Expected:* sf.unix_nanos(col) is available and returns the same result as
> the SQL unix_nanos() function (DECIMAL(21,0) nanoseconds since epoch).
> *Actual:* AttributeError — function is not exposed in the PySpark API.
> *Work Needed*
> 1. python/pyspark/sql/functions/builtin.py
> Add unix_nanos() function after unix_micros (line ~11749), following the
> same pattern as unix_micros:
> {code:java}
> @_try_remote_functions
> def unix_nanos(col: "ColumnOrName") -> Column:
> """Returns the number of nanoseconds since 1970-01-01 00:00:00 UTC
> as DECIMAL(21, 0). Only supports TIMESTAMP_LTZ(p) and
> TIMESTAMP_NTZ(p)
> with precision p in [7, 9].
> ...
> """
> return _invoke_function_over_columns("unix_nanos", col){code}
> 2. python/pyspark/sql/functions/{_}{{_}}init{{_}}{_}.py
> Export unix_nanos in the {_}{{_}}init{{_}}{_} alongside
> unix_micros/millis/seconds.
> 3. python/pyspark/sql/connect/functions/builtin.py
> Add the Connect-side wrapper for unix_nanos, following the same structure
> as unix_micros in that file.
> 4. python/pyspark/sql/tests/test_functions.py
> Remove "unix_nanos" from the expected_missing_in_py set (and its comment).
> 5. Add a doctest in the unix_nanos docstring covering:
> - A nanosecond-precision TIMESTAMP_NTZ input
> - A NULL input (returns NULL) following the style of unix_micros (lines
> 11735-11747).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]