[jira] [Resolved] (SPARK-44717) "pyspark.pandas.resample" is incorrect when DST is overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not help

Hyukjin Kwon (Jira) Tue, 08 Aug 2023 19:05:05 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-44717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon resolved SPARK-44717.
----------------------------------
    Fix Version/s: 3.5.0
                   4.0.0
       Resolution: Fixed

Issue resolved by pull request 42392
[https://github.com/apache/spark/pull/42392]

> "pyspark.pandas.resample" is incorrect when DST is overlapped and setting 
> "spark.sql.timestampType" to TIMESTAMP_NTZ does not help
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-44717
>                 URL: https://issues.apache.org/jira/browse/SPARK-44717
>             Project: Spark
>          Issue Type: Bug
>          Components: Pandas API on Spark
>    Affects Versions: 3.4.0, 3.4.1, 4.0.0
>            Reporter: Attila Zsolt Piros
>            Assignee: Hyukjin Kwon
>            Priority: Major
>             Fix For: 3.5.0, 4.0.0
>
>
> Use one of the existing test:
> - "11H" case of test_dataframe_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests) 
> - "1001H" case of test_series_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests) 
> After setting the TZ for example to New York (like by using the following 
> python code in a "setUpClass":  
> {noformat}
> os.environ["TZ"] = 'America/New_York'
> {noformat})
> You will get the error for the latter mentioned test:
> {noformat}
> ======================================================================
> FAIL [4.219s]: test_series_resample 
> (pyspark.pandas.tests.test_resample.ResampleTests)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
> 276, in test_series_resample
>     self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", 
> "right", "sum")
>   File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
> 259, in _test_resample
>     self.assert_eq(
>   File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 457, in 
> assert_eq
>     _assert_pandas_almost_equal(lobj, robj)
>   File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 228, in 
> _assert_pandas_almost_equal
>     raise PySparkAssertionError(
> pyspark.errors.exceptions.base.PySparkAssertionError: 
> [DIFFERENT_PANDAS_SERIES] Series are not almost equal:
> Left:
> Freq: 1001H
> float64
> Right:
> float64
> {noformat}
> The problem is the in the pyspark resample there will be more resampled rows 
> in the result. The DST change will cause those extra lines as the computed 
> __tmp_resample_bin_col__ be something like:
> {noformat}
> | __index_level_0__  | __tmp_resample_bin_col__ | A
> .....
> |2011-03-08 00:00:00|2011-03-26 11:00:00     |0.3980551570183919  |
> |2011-03-09 00:00:00|2011-03-26 11:00:00     |0.6511376673995046  |
> |2011-03-10 00:00:00|2011-03-26 11:00:00     |0.6141085426890365  |
> |2011-03-11 00:00:00|2011-03-26 11:00:00     |0.11557638066163867 |
> |2011-03-12 00:00:00|2011-03-26 11:00:00     |0.4517788243490799  |
> |2011-03-13 00:00:00|2011-03-26 11:00:00     |0.8637060550157284  |
> |2011-03-14 00:00:00|2011-03-26 10:00:00     |0.8169499149450166  |
> |2011-03-15 00:00:00|2011-03-26 10:00:00     |0.4585916249356583  |
> |2011-03-16 00:00:00|2011-03-26 10:00:00     |0.8362472880832088  |
> |2011-03-17 00:00:00|2011-03-26 10:00:00     |0.026716901748386812|
> |2011-03-18 00:00:00|2011-03-26 10:00:00     |0.9086816462089563  |
> {noformat}
> You can see the extra lines around when the DST kicked in on 2011-03-13 in 
> New York.
> Even setting the conf "spark.sql.timestampType" to"TIMESTAMP_NTZ" does not 
> help.
> You can see my tests here:
> https://github.com/attilapiros/spark/pull/5
> Pandas timestamps are TZ less:
> `
> {noformat}
> import pandas as pd
> a = pd.Timestamp(year=2011, month=3, day=13, hour=1)
> b = pd.Timedelta(hours=1)
> >> a 
> Timestamp('2011-03-13 01:00:00')
> >>> a+b
> Timestamp('2011-03-13 02:00:00')
> >>> a+b+b
> Timestamp('2011-03-13 03:00:00')
> {noformat}
> But pyspark TimestampType uses TZ and DST:
> {noformat}
> >>> sql("select  TIMESTAMP '2011-03-13 01:00:00'").show()
> +-------------------------------+
> |TIMESTAMP '2011-03-13 01:00:00'|
> +-------------------------------+
> |            2011-03-13 01:00:00|
> +-------------------------------+
> >>> sql("select  TIMESTAMP '2011-03-13 01:00:00' + 
> >>> make_interval(0,0,0,0,1,0,0)").show()
> +--------------------------------------------------------------------+
> |TIMESTAMP '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)|
> +--------------------------------------------------------------------+
> |                                                 2011-03-13 03:00:00|
> +--------------------------------------------------------------------+
> {noformat}
> The current resample code uses the above interval based calculation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44717) "pyspark.pandas.resample" is incorrect when DST is overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not help

Reply via email to