gaogaotiantian commented on PR #53161:
URL: https://github.com/apache/spark/pull/53161#issuecomment-3582932277

   @cloud-fan , I understand `TimestampType` under the hood is just a UTC epoch 
timestamp. We need to convert to UTC timestamp so we have to assume a timezone 
for naive timestamps - I don't believe we are doing that.
   
   ```python
   spark.conf.set("spark.sql.session.timeZone", "UTC")
   df = spark.createDataFrame([
       (datetime.datetime(1990, 8, 10, 0, 0, tzinfo=datetime.timezone.utc),),
       (datetime.datetime(1990, 8, 10, 0, 0),)
   ], ["ts"])
   df.show()
   ```
   
   The two columns above are different - because we do not respect session 
timezone when converting them. Notice that UDF is not involved at this point.
   
   
https://github.com/apache/spark/blob/0908ec5272e7b790761f64b46553c4567ad96d01/python/pyspark/sql/types.py#L441
   
   We don't check timezone info in `toInternal` and `fromInternal`. (I don't 
know if there's other secrets like changing the system timezone, but the result 
is different).
   
   We can fix that with some hacks, if we really want to - that's the option 1 
I mentioned above. Again, we need to do it *everywhere*.
   
   However, that is not the full picture. We have an even worse issue about 
`datetime.datetime` - yes, internally we can convert it to an EPOCH timestamp, 
but the user might want to play with in in Python.
   
   ```python
   @udf(returnType=BooleanType())
   def greater(ts):
       return ts > datetime.datetime(1990, 8, 10, 0, 0, 
tzinfo=datetime.timezone.utc)
   ```
   
   The code above will raise an error, because we convert the `TimestampType` 
to a naive datetime - even though we claim that `TimestampType` is timezone 
aware. It's illegal to compare a naive timestamp with an aware timestamp in 
Python (you can do `==` check but it will always return `False`).
   
   Also I found a issue with probably DST.
   
   ```python
   @udf(returnType=BooleanType())
   def same(ts):
       return ts == datetime.datetime(1990, 8, 10, 0, 0)
   
   df = spark.createDataFrame([
       (datetime.datetime(1990, 8, 10, 0, 0),)
   ], ["ts"])
   
   df.select(same("ts")).show()
   ```
   
   Even this returns `False` - there's an hour diff, probably due to some 
missing DST checks.
   
   Back to my point - our `TimestampType` on Python is just broken - it will 
disappoint users when they try to do some manipulation on it. We can't mix 
naive and aware timestamps together because Python does not support it.
   
   This is why I propose my second option - it's a bit aggressive but we can 
make it right - to always map `TimestampType` with aware `datetime` and 
`TimestampNTZType` with naive `datetime`. I believe that's the only chance that 
we can make it completely correct.
   
   However, there is a risk that some of the existing user code can break. If 
that's a concern. We can just leave this broken. It still works in some 
occasions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to