[ 
https://issues.apache.org/jira/browse/ARROW-9866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186387#comment-17186387
 ] 

Joris Van den Bossche commented on ARROW-9866:
----------------------------------------------

This comes down to the timezone handling (how are the pandas Timestamp values 
with potentially a timezone converted to a pyarrow scalar for the comparison).

Using your example timestamps, but only looking at the conversion and equality 
check for a moment:

{code:python}
import itertools
import pandas
import pyarrow
import pyarrow.compute
import pytz


for left, right in itertools.product(
    [
        pandas.Timestamp("2000-01-01 00:00:00"),
        pandas.Timestamp("2000-01-01 00:00:00", tz="UTC"),
        pandas.Timestamp("2000-01-01 00:00:00", tz="US/Eastern"),
        pandas.Timestamp("1999-12-31 19:00:00", tz=pytz.FixedOffset(-300)),
    ],
    repeat=2,
):
    typ = pyarrow.array(pandas.Series([left])).type
    scalar_left = pyarrow.scalar(left, type=typ)
    scalar_right = pyarrow.scalar(right, type=typ)
    equal = pyarrow.compute.equal(scalar_left, scalar_right)
    print(f"Left : {left} -> {scalar_left} ({scalar_left.type})")
    print(f"Right: {right} -> {scalar_right} ({scalar_right.type})")
    print(f"Equal: {equal}\n")
{code}

(I am converting the right timestamp (what you are filtering with) to the type 
of the left, as that is what also happens when filtering with the `filter=` 
expression)

With pyarrow 1.0, this gives:

{code}
Left : 2000-01-01 00:00:00 -> 2000-01-01 00:00:00 (timestamp[ns])
Right: 2000-01-01 00:00:00 -> 2000-01-01 00:00:00 (timestamp[ns])
Equal: True

Left : 2000-01-01 00:00:00 -> 2000-01-01 00:00:00 (timestamp[ns])
Right: 2000-01-01 00:00:00+00:00 -> 2000-01-01 00:00:00 (timestamp[ns])
Equal: True

Left : 2000-01-01 00:00:00 -> 2000-01-01 00:00:00 (timestamp[ns])
Right: 2000-01-01 00:00:00-05:00 -> 2000-01-01 00:00:00 (timestamp[ns])
Equal: True

Left : 2000-01-01 00:00:00 -> 2000-01-01 00:00:00 (timestamp[ns])
Right: 1999-12-31 19:00:00-05:00 -> 1999-12-31 19:00:00 (timestamp[ns])
Equal: False

Left : 2000-01-01 00:00:00+00:00 -> 2000-01-01 00:00:00+00:00 (timestamp[ns, 
tz=UTC])
Right: 2000-01-01 00:00:00 -> 2000-01-01 00:00:00+00:00 (timestamp[ns, tz=UTC])
Equal: True

Left : 2000-01-01 00:00:00+00:00 -> 2000-01-01 00:00:00+00:00 (timestamp[ns, 
tz=UTC])
Right: 2000-01-01 00:00:00+00:00 -> 2000-01-01 00:00:00+00:00 (timestamp[ns, 
tz=UTC])
Equal: True

Left : 2000-01-01 00:00:00+00:00 -> 2000-01-01 00:00:00+00:00 (timestamp[ns, 
tz=UTC])
Right: 2000-01-01 00:00:00-05:00 -> 2000-01-01 00:00:00+00:00 (timestamp[ns, 
tz=UTC])
Equal: True

Left : 2000-01-01 00:00:00+00:00 -> 2000-01-01 00:00:00+00:00 (timestamp[ns, 
tz=UTC])
Right: 1999-12-31 19:00:00-05:00 -> 1999-12-31 19:00:00+00:00 (timestamp[ns, 
tz=UTC])
Equal: False

Left : 2000-01-01 00:00:00-05:00 -> 1999-12-31 19:00:00-05:00 (timestamp[ns, 
tz=US/Eastern])
Right: 2000-01-01 00:00:00 -> 1999-12-31 19:00:00-05:00 (timestamp[ns, 
tz=US/Eastern])
Equal: True

Left : 2000-01-01 00:00:00-05:00 -> 1999-12-31 19:00:00-05:00 (timestamp[ns, 
tz=US/Eastern])
Right: 2000-01-01 00:00:00+00:00 -> 1999-12-31 19:00:00-05:00 (timestamp[ns, 
tz=US/Eastern])
Equal: True

Left : 2000-01-01 00:00:00-05:00 -> 1999-12-31 19:00:00-05:00 (timestamp[ns, 
tz=US/Eastern])
Right: 2000-01-01 00:00:00-05:00 -> 1999-12-31 19:00:00-05:00 (timestamp[ns, 
tz=US/Eastern])
Equal: True

Left : 2000-01-01 00:00:00-05:00 -> 1999-12-31 19:00:00-05:00 (timestamp[ns, 
tz=US/Eastern])
Right: 1999-12-31 19:00:00-05:00 -> 1999-12-31 14:00:00-05:00 (timestamp[ns, 
tz=US/Eastern])
Equal: False

Left : 1999-12-31 19:00:00-05:00 -> 1999-12-31 14:00:00-05:00 (timestamp[ns, 
tz=-05:00])
Right: 2000-01-01 00:00:00 -> 1999-12-31 19:00:00-05:00 (timestamp[ns, 
tz=-05:00])
Equal: False

Left : 1999-12-31 19:00:00-05:00 -> 1999-12-31 14:00:00-05:00 (timestamp[ns, 
tz=-05:00])
Right: 2000-01-01 00:00:00+00:00 -> 1999-12-31 19:00:00-05:00 (timestamp[ns, 
tz=-05:00])
Equal: False

Left : 1999-12-31 19:00:00-05:00 -> 1999-12-31 14:00:00-05:00 (timestamp[ns, 
tz=-05:00])
Right: 2000-01-01 00:00:00-05:00 -> 1999-12-31 19:00:00-05:00 (timestamp[ns, 
tz=-05:00])
Equal: False

Left : 1999-12-31 19:00:00-05:00 -> 1999-12-31 14:00:00-05:00 (timestamp[ns, 
tz=-05:00])
Right: 1999-12-31 19:00:00-05:00 -> 1999-12-31 14:00:00-05:00 (timestamp[ns, 
tz=-05:00])
Equal: True
{code}

However, there was recently a change / fix in how timestamps are converted to 
pyarrow if they are timezone aware (https://github.com/apache/arrow/pull/7816, 
ARROW-9528). So running the same with master gives:

{code}
Left : 2000-01-01 00:00:00 -> 2000-01-01 00:00:00 (timestamp[ns])
Right: 2000-01-01 00:00:00 -> 2000-01-01 00:00:00 (timestamp[ns])
Equal: True

Left : 2000-01-01 00:00:00 -> 2000-01-01 00:00:00 (timestamp[ns])
Right: 2000-01-01 00:00:00+00:00 -> 2000-01-01 00:00:00 (timestamp[ns])
Equal: True

Left : 2000-01-01 00:00:00 -> 2000-01-01 00:00:00 (timestamp[ns])
Right: 2000-01-01 00:00:00-05:00 -> 2000-01-01 05:00:00 (timestamp[ns])
Equal: False

Left : 2000-01-01 00:00:00 -> 2000-01-01 00:00:00 (timestamp[ns])
Right: 1999-12-31 19:00:00-05:00 -> 2000-01-01 00:00:00 (timestamp[ns])
Equal: True

Left : 2000-01-01 00:00:00+00:00 -> 2000-01-01 00:00:00+00:00 (timestamp[ns, 
tz=UTC])
Right: 2000-01-01 00:00:00 -> 2000-01-01 00:00:00+00:00 (timestamp[ns, tz=UTC])
Equal: True

Left : 2000-01-01 00:00:00+00:00 -> 2000-01-01 00:00:00+00:00 (timestamp[ns, 
tz=UTC])
Right: 2000-01-01 00:00:00+00:00 -> 2000-01-01 00:00:00+00:00 (timestamp[ns, 
tz=UTC])
Equal: True

Left : 2000-01-01 00:00:00+00:00 -> 2000-01-01 00:00:00+00:00 (timestamp[ns, 
tz=UTC])
Right: 2000-01-01 00:00:00-05:00 -> 2000-01-01 05:00:00+00:00 (timestamp[ns, 
tz=UTC])
Equal: False

Left : 2000-01-01 00:00:00+00:00 -> 2000-01-01 00:00:00+00:00 (timestamp[ns, 
tz=UTC])
Right: 1999-12-31 19:00:00-05:00 -> 2000-01-01 00:00:00+00:00 (timestamp[ns, 
tz=UTC])
Equal: True

Left : 2000-01-01 00:00:00-05:00 -> 2000-01-01 00:00:00-05:00 (timestamp[ns, 
tz=US/Eastern])
Right: 2000-01-01 00:00:00 -> 1999-12-31 19:00:00-05:00 (timestamp[ns, 
tz=US/Eastern])
Equal: False

Left : 2000-01-01 00:00:00-05:00 -> 2000-01-01 00:00:00-05:00 (timestamp[ns, 
tz=US/Eastern])
Right: 2000-01-01 00:00:00+00:00 -> 1999-12-31 19:00:00-05:00 (timestamp[ns, 
tz=US/Eastern])
Equal: False

Left : 2000-01-01 00:00:00-05:00 -> 2000-01-01 00:00:00-05:00 (timestamp[ns, 
tz=US/Eastern])
Right: 2000-01-01 00:00:00-05:00 -> 2000-01-01 00:00:00-05:00 (timestamp[ns, 
tz=US/Eastern])
Equal: True

Left : 2000-01-01 00:00:00-05:00 -> 2000-01-01 00:00:00-05:00 (timestamp[ns, 
tz=US/Eastern])
Right: 1999-12-31 19:00:00-05:00 -> 1999-12-31 19:00:00-05:00 (timestamp[ns, 
tz=US/Eastern])
Equal: False

Left : 1999-12-31 19:00:00-05:00 -> 1999-12-31 19:00:00-05:00 (timestamp[ns, 
tz=-05:00])
Right: 2000-01-01 00:00:00 -> 1999-12-31 19:00:00-05:00 (timestamp[ns, 
tz=-05:00])
Equal: True

Left : 1999-12-31 19:00:00-05:00 -> 1999-12-31 19:00:00-05:00 (timestamp[ns, 
tz=-05:00])
Right: 2000-01-01 00:00:00+00:00 -> 1999-12-31 19:00:00-05:00 (timestamp[ns, 
tz=-05:00])
Equal: True

Left : 1999-12-31 19:00:00-05:00 -> 1999-12-31 19:00:00-05:00 (timestamp[ns, 
tz=-05:00])
Right: 2000-01-01 00:00:00-05:00 -> 2000-01-01 00:00:00-05:00 (timestamp[ns, 
tz=-05:00])
Equal: False

Left : 1999-12-31 19:00:00-05:00 -> 1999-12-31 19:00:00-05:00 (timestamp[ns, 
tz=-05:00])
Right: 1999-12-31 19:00:00-05:00 -> 1999-12-31 19:00:00-05:00 (timestamp[ns, 
tz=-05:00])
Equal: True
{code}

The above now seems correct to me.

> [Python] Incorrect timestamp column filtering
> ---------------------------------------------
>
>                 Key: ARROW-9866
>                 URL: https://issues.apache.org/jira/browse/ARROW-9866
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.0
>            Reporter: Josh
>            Priority: Minor
>
> Here are some sample test cases:
>  
> {code:java}
> import io
> import itertools
> import pandas
> import pyarrow
> import pyarrow.dataset
> import pyarrow.parquet
> import pytest
> import pytz
> @pytest.mark.parametrize(
>     "data_date, filter_date",
>     itertools.product(
>         [
>             pandas.Timestamp("2000-01-01 00:00:00"),
>             pandas.Timestamp("2000-01-01 00:00:00", tz="UTC"),
>             pandas.Timestamp("2000-01-01 00:00:00", tz="US/Eastern"),
>             pandas.Timestamp("1999-12-31 19:00:00", 
> tz=pytz.FixedOffset(-300)),
>         ],
>         repeat=2,
>     ),
>     ids=lambda x: x.isoformat(),
> )
> def test_timestsamp_filter(data_date, filter_date):
>     data_date = pandas.Timestamp(data_date)
>     filter_date = pandas.Timestamp(filter_date)
>     df = pandas.DataFrame(dict(date=[data_date]))
>     try:
>         if data_date == filter_date:
>             expected = df
>         else:
>             # empty frame
>             expected = df.iloc[:0, :]
>     except TypeError:
>         # empty frame
>         expected = df.iloc[:0, :]
>     fileobj = io.BytesIO()
>     pyarrow.parquet.write_table(pyarrow.Table.from_pandas(df), fileobj)
>     actual = pyarrow.parquet.read_table(fileobj, 
> filters=pyarrow.dataset.field("date") == filter_date).to_pandas()
>     pandas.testing.assert_frame_equal(actual, expected)
> {code}
>  Pytest summary:
> {noformat}
> =========================== short test summary info 
> ============================
> FAILED 
> test_arrow.py::test_timestsamp_filter[2000-01-01T00:00:00-2000-01-01T00:00:00+00:00]
> FAILED 
> test_arrow.py::test_timestsamp_filter[2000-01-01T00:00:00-2000-01-01T00:00:00-05:00]
> FAILED 
> test_arrow.py::test_timestsamp_filter[2000-01-01T00:00:00+00:00-2000-01-01T00:00:00]
> FAILED 
> test_arrow.py::test_timestsamp_filter[2000-01-01T00:00:00+00:00-2000-01-01T00:00:00-05:00]
> FAILED 
> test_arrow.py::test_timestsamp_filter[2000-01-01T00:00:00+00:00-1999-12-31T19:00:00-05:00]
> FAILED 
> test_arrow.py::test_timestsamp_filter[2000-01-01T00:00:00-05:00-2000-01-01T00:00:00-05:00]
> FAILED 
> test_arrow.py::test_timestsamp_filter[1999-12-31T19:00:00-05:00-2000-01-01T00:00:00]
> FAILED 
> test_arrow.py::test_timestsamp_filter[1999-12-31T19:00:00-05:00-2000-01-01T00:00:00-05:00]
> FAILED 
> test_arrow.py::test_timestsamp_filter[1999-12-31T19:00:00-05:00-1999-12-31T19:00:00-05:00]
> ========================= 9 failed, 7 passed in 0.23s 
> =========================={noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to