[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

2020-07-06 Thread GitBox


jorisvandenbossche commented on a change in pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#discussion_r450201426



##
File path: cpp/src/arrow/python/datetime.cc
##
@@ -262,6 +302,42 @@ int64_t PyDate_to_days(PyDateTime_Date* pydate) {
 PyDateTime_GET_DAY(pydate));
 }
 
+// GIL must be held when calling this function.
+Status StringToTzinfo(const std::string& tz, PyObject** tzinfo) {

Review comment:
   Looks fine!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

2020-07-03 Thread GitBox


jorisvandenbossche commented on a change in pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#discussion_r449479330



##
File path: cpp/src/arrow/python/datetime.cc
##
@@ -262,6 +302,42 @@ int64_t PyDate_to_days(PyDateTime_Date* pydate) {
 PyDateTime_GET_DAY(pydate));
 }
 
+// GIL must be held when calling this function.
+Status StringToTzinfo(const std::string& tz, PyObject** tzinfo) {

Review comment:
   It might go a bit too far (and can easily get outdated), but given that 
the below is basically python code written in C++, for me it would help in 
understanding it by including the short python snippet that it represents as a 
comment, eg
   
   ```
   def string_to_tzinfo(tz):
   import pytz
   
   match = ... 
   if match:
   sign, hours, minutes = ...
   return pytz.FixedOffset(sign * (hours * 60 + minutes))
   else:
   return pytz.timezone(tz)
   ```

##
File path: cpp/src/arrow/python/datetime.cc
##
@@ -14,22 +14,62 @@
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.
+#include "arrow/python/datetime.h"
 
 #include 
 #include 
 #include 
 
 #include "arrow/python/common.h"
-#include "arrow/python/datetime.h"
+#include "arrow/python/helpers.h"
 #include "arrow/python/platform.h"
 #include "arrow/status.h"
 #include "arrow/type.h"
 #include "arrow/util/logging.h"
+#include "arrow/util/value_parsing.h"
 
 namespace arrow {
 namespace py {
 namespace internal {
 
+namespace {
+
+bool MatchFixedOffset(const std::string& tz, util::string_view* sign,
+  util::string_view* hour, util::string_view* minute) {
+  if (tz.size() < 5) {

Review comment:
   Maybe add a comment here that this can be replaced with a simpler regex 
once we can use std::regex? (and eg pointing to this PR to see the regex)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7604: ARROW-9223: [Python] Propagate timezone information in pandas conversion

2020-07-02 Thread GitBox


jorisvandenbossche commented on a change in pull request #7604:
URL: https://github.com/apache/arrow/pull/7604#discussion_r449109405



##
File path: cpp/src/arrow/python/arrow_to_pandas.cc
##
@@ -642,24 +641,27 @@ inline Status ConvertStruct(const PandasOptions& options, 
const ChunkedArray& da
   std::vector fields_data(num_fields);
   OwnedRef dict_item;
 
-  // XXX(wesm): In ARROW-7723, we found as a result of ARROW-3789 that second
+  // In ARROW-7723, we found as a result of ARROW-3789 that second
   // through microsecond resolution tz-aware timestamps were being promoted to
   // use the DATETIME_NANO_TZ conversion path, yielding a datetime64[ns] NumPy
   // array in this function. PyArray_GETITEM returns datetime.datetime for
   // units second through microsecond but PyLong for nanosecond (because
-  // datetime.datetime does not support nanoseconds). We inserted this hack to
-  // preserve the <= 0.15.1 behavior until a better solution can be devised
+  // datetime.datetime does not support nanoseconds).
+  // We force the object conversion to preserve the value of the timezone.
   PandasOptions modified_options = options;
-  modified_options.ignore_timezone = true;
   modified_options.coerce_temporal_nanoseconds = false;
 
   for (int c = 0; c < data.num_chunks(); c++) {
 auto arr = checked_cast(data.chunk(c).get());
 // Convert the struct arrays first
 for (int32_t i = 0; i < num_fields; i++) {
+  // Se notes above about conversion.
+  std::shared_ptr field = arr->field(static_cast(i));
+  modified_options.timestamp_as_object =
+  field->type()->id() == Type::TIMESTAMP &&
+  !checked_cast(*field->type()).timezone().empty();

Review comment:
   I am a bit confused here.
   
   > timezone naive datetimes reflect system timezone
   
   Who does this? Does arrow do this in conversion? Or do you mean that 
`PyArray_GETITEM` does this? (AFAIK numpy used to do things with system time 
zone, but that should be solved nowadays)
   
   > Our timestamp -> datetime conversion assumes UTC as far as I can tell 
(like I said not sure what pandas does here).
   
   Where do we assume this? (and this is about naive or aware timestamps?)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org