pitrou commented on a change in pull request #7805:
URL: https://github.com/apache/arrow/pull/7805#discussion_r457436813



##########
File path: cpp/src/arrow/python/inference.cc
##########
@@ -332,6 +329,13 @@ class TypeInferrer {
       ++int_count_;
     } else if (PyDateTime_Check(obj)) {
       ++timestamp_micro_count_;
+      OwnedRef tzinfo(PyObject_GetAttrString(obj, "tzinfo"));
+      if (tzinfo.obj() != nullptr && tzinfo.obj() != Py_None && 
timezone_.empty()) {
+        // From public docs on array construction
+        // "Localized timestamps will currently be returned as UTC "
+        //     representation). "

Review comment:
       If I reuse the nomenclature from the test, I get without this PR (it's 
14h21 UTC currently):
   ```python
   >>> now_utc                                                                  
                                   
   datetime.datetime(2020, 7, 20, 14, 21, 42, 96119, tzinfo=<UTC>)
   >>> arr = pa.array([now_utc])                                                
                                   
   >>> arr                                                                      
                                   
   <pyarrow.lib.TimestampArray object at 0x7f44b0bfcc90>
   [
     2020-07-20 14:21:42.096119
   ]
   >>> arr.type.tz                                                              
                                   
   >>> arr.to_pylist()                                                          
                                   
   [datetime.datetime(2020, 7, 20, 14, 21, 42, 96119)]
   >>> arr.to_pandas()                                                          
                                   
   0   2020-07-20 14:21:42.096119
   dtype: datetime64[ns]
   ```
   ```python
   >>> now_with_tz                                                              
                                             
   datetime.datetime(2020, 7, 20, 10, 21, 42, 96119, tzinfo=<DstTzInfo 
'US/Eastern' EDT-1 day, 20:00:00 DST>)
   >>> arr = pa.array([now_with_tz])                                            
                                             
   >>> arr.type.tz                                                              
                                             
   >>> arr.to_pylist()                                                          
                                             
   [datetime.datetime(2020, 7, 20, 10, 21, 42, 96119)]
   >>> arr.to_pylist()[0].tzinfo                                                
                                             
   >>> arr.to_pandas()                                                          
                                             
   0   2020-07-20 10:21:42.096119
   dtype: datetime64[ns]
   ```
   So without the PR, there's a problem for non-UTC timestamps which lose the 
timezone information. It seems to get worse with this PR because non-UTC 
timestamps get tagged as UTC without being corrected for the timezone's offset, 
which is misleading. At least originally you may be alerted by the absence of a 
timezone on the type (in Python terms, it's a "naive" timestamp).
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to