gaogaotiantian commented on PR #53161:
URL: https://github.com/apache/spark/pull/53161#issuecomment-3582932277
@cloud-fan , I understand `TimestampType` under the hood is just a UTC epoch
timestamp. We need to convert to UTC timestamp so we have to assume a timezone
for naive timestamps - I don't believe we are doing that.
```python
spark.conf.set("spark.sql.session.timeZone", "UTC")
df = spark.createDataFrame([
(datetime.datetime(1990, 8, 10, 0, 0, tzinfo=datetime.timezone.utc),),
(datetime.datetime(1990, 8, 10, 0, 0),)
], ["ts"])
df.show()
```
The two columns above are different - because we do not respect session
timezone when converting them. Notice that UDF is not involved at this point.
https://github.com/apache/spark/blob/0908ec5272e7b790761f64b46553c4567ad96d01/python/pyspark/sql/types.py#L441
We don't check timezone info in `toInternal` and `fromInternal`. (I don't
know if there's other secrets like changing the system timezone, but the result
is different).
We can fix that with some hacks, if we really want to - that's the option 1
I mentioned above. Again, we need to do it *everywhere*.
However, that is not the full picture. We have an even worse issue about
`datetime.datetime` - yes, internally we can convert it to an EPOCH timestamp,
but the user might want to play with in in Python.
```python
@udf(returnType=BooleanType())
def greater(ts):
return ts > datetime.datetime(1990, 8, 10, 0, 0,
tzinfo=datetime.timezone.utc)
```
The code above will raise an error, because we convert the `TimestampType`
to a naive datetime - even though we claim that `TimestampType` is timezone
aware. It's illegal to compare a naive timestamp with an aware timestamp in
Python (you can do `==` check but it will always return `False`).
Also I found a issue with probably DST.
```python
@udf(returnType=BooleanType())
def same(ts):
return ts == datetime.datetime(1990, 8, 10, 0, 0)
df = spark.createDataFrame([
(datetime.datetime(1990, 8, 10, 0, 0),)
], ["ts"])
df.select(same("ts")).show()
```
Even this returns `False` - there's an hour diff, probably due to some
missing DST checks.
Back to my point - our `TimestampType` on Python is just broken - it will
disappoint users when they try to do some manipulation on it. We can't mix
naive and aware timestamps together because Python does not support it.
This is why I propose my second option - it's a bit aggressive but we can
make it right - to always map `TimestampType` with aware `datetime` and
`TimestampNTZType` with naive `datetime`. I believe that's the only chance that
we can make it completely correct.
However, there is a risk that some of the existing user code can break. If
that's a concern. We can just leave this broken. It still works in some
occasions.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]