GitHub user 0x0FFF opened a pull request:

    https://github.com/apache/spark/pull/8555

    [SPARK-10162] [SQL] Fix the timezone omitting for PySpark Dataframe filter 
function

    This PR addresses 
[SPARK-10162](https://issues.apache.org/jira/browse/SPARK-10162)
    The issue is with DataFrame filter() function, if datetime.datetime is 
passed to it:
    * Timezone information of this datetime is ignored
    * This datetime is assumed to be in local timezone, which depends on the OS 
timezone setting
    
    Fix includes both code change and regression test. Problem reproduction 
code on master:
    ```python
    import pytz
    from datetime import datetime
    from pyspark.sql import *
    from pyspark.sql.types import *
    sqc = SQLContext(sc)
    df = sqc.createDataFrame([], StructType([StructField("dt", 
TimestampType())]))
    
    m1 = pytz.timezone('UTC')
    m2 = pytz.timezone('Etc/GMT+3')
    
    df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
    df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
    ```
    It gives the same timestamp ignoring time zone:
    ```
    >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
    Filter (dt#0 > 946713600000000)
     Scan PhysicalRDD[dt#0]
    
    >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
    Filter (dt#0 > 946713600000000)
     Scan PhysicalRDD[dt#0]
    ```
    After the fix:
    ```
    >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
    Filter (dt#0 > 946684800000000)
     Scan PhysicalRDD[dt#0]
    
    >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
    Filter (dt#0 > 946695600000000)
     Scan PhysicalRDD[dt#0]
    ```
    PR [8536](https://github.com/apache/spark/pull/8536) was occasionally 
closed by me dropping the repo

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/0x0FFF/spark SPARK-10162

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/8555.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #8555
    
----
commit 610cb3f3cb5713e9e733ccc36fdc197eae7f4fe5
Author: 0x0FFF <[email protected]>
Date:   2015-09-01T12:30:09Z

    [SPARK-10162] [SQL] Fix the timezone omitting for PySpark Dataframe filter 
function

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to