[
https://issues.apache.org/jira/browse/SPARK-40835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rohan Barman updated SPARK-40835:
---------------------------------
Description:
We are in the process of migrating our PySpark applications from Spark version
3.1.2 to Spark version 3.2.0.
This bug is present in version 3.2.0. We do not see this issue in version 3.1.2.
*Minimal example to reproduce bug*
Below is a minimal example of applying to_utc_timestamp() on String column with
timestamp data
{code:java}
from pyspark.sql.types import StringType
from pyspark.sql.functions import *
# Source data
columns = ["id","timestamp_field"]
data = [("1", "2022-10-17T00:00:00+0000"), ("2", "2022-10-17T00:00:00+0000")]
source_df = spark.createDataFrame(data).toDF(*columns)
source_df.createOrReplaceTempView("source")
print("Source:")
print(source_df.show())
# Execute query
query = """
SELECT
id,
timestamp_field as original,
to_utc_timestamp(timestamp_field, 'UTC') AS received_timestamp
FROM source
"""
df = spark.sql(query)
print("Transformed:")
print(df.show())
print(df.count()) {code}
*Post Execution*
The source data has a column called _timestamp_field_ which is a string type.
{code:java}
Source:
+---+--------------------+
| id| timestamp_field|
+---+--------------------+
| 1|2022-10-17T00:00:...|
| 2|2022-10-17T00:00:...|
+---+--------------------+
{code}
The query applies to_utc_timestamp() on timestamp_field to create a new column.
The new column is null.
{code:java}
Transformed:
+---+--------------------+------------------+
| id| original|received_timestamp|
+---+--------------------+------------------+
| 1|2022-10-16T00:00:...| null|
| 2|2022-10-16T00:00:...| null|
+---+--------------------+------------------+ {code}
–
*Questions*
* Did the to_utc_timestamp function get any new changes in spark version
3.2.0? We don't see this issue in spark 3.1.2
* Can we apply any spark settings to resolve this?
* Is there a new preferred function in spark 3.2.0 that replaces
to_utc_timestamp?
was:
We are in the process of migrating our PySpark applications from Spark version
3.1.2 to Spark version 3.2.0.
This bug is present in version 3.2.0. We do not see this issue in version 3.1.2.
*Minimal example to reproduce bug*
Below is a minimal example of applying to_utc_timestamp() on String column that
has data representing a timestamp
{code:java}
from pyspark.sql.types import StringType
from pyspark.sql.functions import *
# Source data
columns = ["id","timestamp_field"]
data = [("1", "2022-10-17T00:00:00+0000"), ("2", "2022-10-17T00:00:00+0000")]
source_df = spark.createDataFrame(data).toDF(*columns)
source_df.createOrReplaceTempView("source")
print("Source:")
print(source_df.show())
# Execute query
query = """
SELECT
id,
timestamp_field as original,
to_utc_timestamp(timestamp_field, 'UTC') AS received_timestamp
FROM source
"""
df = spark.sql(query)
print("Transformed:")
print(df.show())
print(df.count()) {code}
*Post Execution*
The source data has a column called _timestamp_field_ which is a string type.
{code:java}
Source:
+---+--------------------+
| id| timestamp_field|
+---+--------------------+
| 1|2022-10-17T00:00:...|
| 2|2022-10-17T00:00:...|
+---+--------------------+
{code}
The query applies to_utc_timestamp() on timestamp_field to create a new column.
The new column is null.
{code:java}
Transformed:
+---+--------------------+------------------+
| id| original|received_timestamp|
+---+--------------------+------------------+
| 1|2022-10-16T00:00:...| null|
| 2|2022-10-16T00:00:...| null|
+---+--------------------+------------------+ {code}
–
*Questions*
* Did the to_utc_timestamp function get any new changes in spark version
3.2.0? We don't see this issue in spark 3.1.2
* Can we apply any spark settings to resolve this?
* Is there a new preferred function in spark 3.2.0 that replaces
to_utc_timestamp?
> to_utc_timestamp creates null column
> ------------------------------------
>
> Key: SPARK-40835
> URL: https://issues.apache.org/jira/browse/SPARK-40835
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 3.2.0
> Reporter: Rohan Barman
> Priority: Major
>
> We are in the process of migrating our PySpark applications from Spark
> version 3.1.2 to Spark version 3.2.0.
> This bug is present in version 3.2.0. We do not see this issue in version
> 3.1.2.
>
> *Minimal example to reproduce bug*
> Below is a minimal example of applying to_utc_timestamp() on String column
> with timestamp data
> {code:java}
> from pyspark.sql.types import StringType
> from pyspark.sql.functions import *
> # Source data
> columns = ["id","timestamp_field"]
> data = [("1", "2022-10-17T00:00:00+0000"), ("2", "2022-10-17T00:00:00+0000")]
> source_df = spark.createDataFrame(data).toDF(*columns)
> source_df.createOrReplaceTempView("source")
> print("Source:")
> print(source_df.show())
> # Execute query
> query = """
> SELECT
> id,
> timestamp_field as original,
> to_utc_timestamp(timestamp_field, 'UTC') AS received_timestamp
> FROM source
> """
> df = spark.sql(query)
> print("Transformed:")
> print(df.show())
> print(df.count()) {code}
> *Post Execution*
> The source data has a column called _timestamp_field_ which is a string type.
> {code:java}
> Source:
> +---+--------------------+
>
> | id| timestamp_field|
> +---+--------------------+
> | 1|2022-10-17T00:00:...|
> | 2|2022-10-17T00:00:...|
> +---+--------------------+
> {code}
> The query applies to_utc_timestamp() on timestamp_field to create a new
> column. The new column is null.
> {code:java}
> Transformed:
> +---+--------------------+------------------+
> | id| original|received_timestamp|
> +---+--------------------+------------------+
> | 1|2022-10-16T00:00:...| null|
> | 2|2022-10-16T00:00:...| null|
> +---+--------------------+------------------+ {code}
> –
>
> *Questions*
> * Did the to_utc_timestamp function get any new changes in spark version
> 3.2.0? We don't see this issue in spark 3.1.2
> * Can we apply any spark settings to resolve this?
> * Is there a new preferred function in spark 3.2.0 that replaces
> to_utc_timestamp?
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]