zilto opened a new issue, #47043:
URL: https://github.com/apache/arrow/issues/47043

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   ## Summary
   A valid  `pyarrow.TimestampScalar` value throws an error when its 
`.__repr__()` is called. This is specific to when timezones are specified as 
offsets `+07:30`. It doesn't occur with IANA `"America/New_York"` and UTC-like 
`"UTC"`.
   
   ```python
   from datetime import datetime
   import pyarrow
   
   py_value = datetime(2012, 1, 1)
   
   no_tz = pyarrow.scalar(py_value, type=pyarrow.timestamp("s"))
   no_tz.__str__()  # '2012-01-01 00:00:00'
   no_tz.__repr__()  # "<pyarrow.TimestampScalar: '2012-01-01T00:00:00'>"
   
   utc_tz = pyarrow.scalar(py_value, type=pyarrow.timestamp("s", tz="UTC"))
   utc_tz.__str__()  # '2012-01-01 00:00:00'
   utc_tz.__repr__()  # "<pyarrow.TimestampScalar: '2012-01-01T00:00:00'>"
   
   offset_tz = pyarrow.scalar(py_value, type=pyarrow.timestamp("s", 
tz="+00:00"))
   offset_tz.__str__()  # '2012-01-01 00:00:00'
   offset_tz.__repr__()
   # pyarrow.lib.ArrowInvalid: Cannot locate timezone '+00:00': +00:00 not 
found in timezone database
   ```
   
   note: I have limited understanding of `cdef`, `.pyx`, `.pixi`, etc.
   
   ## Resources
   Useful link and resources in my investigation:
   - [cdef of 
TimestampScalar](https://github.com/apache/arrow/blob/d2a171805c63caa27f05232695b753e07c32cb1d/python/pyarrow/scalar.pxi#L768)
 overrides the `Scalar.__repr__()`. It uses `pyarrow.compute.strftime()`
   - `pyarrow.compute.strftime` is [delegated to the 
kernel](https://github.com/apache/arrow/blob/d2a171805c63caa27f05232695b753e07c32cb1d/cpp/src/arrow/compute/api_scalar.cc#L926)
   - The [kernel definition of 
strftime](https://github.com/apache/arrow/blob/d2a171805c63caa27f05232695b753e07c32cb1d/cpp/src/arrow/compute/kernels/scalar_temporal_unary.cc#L1166)
   - The [source of the raised 
error](https://github.com/apache/arrow/blob/d2a171805c63caa27f05232695b753e07c32cb1d/cpp/src/arrow/compute/kernels/temporal_internal.h#L51)
   - Asking the docs chatbot `What are the valid ways to specify the timezone 
of a timestamp value?` suggests that IANA, UTC, and offsets are valid options. 
They are the supported conventiosn in C++ `TimestampType` and Python 
`pyarrow.timestamp`
   
   ## Background
   - This bug was initially found via `pyarrow.csv.CSVWriter().write()` in the 
library Python dlt. 
   - I initially thought it was related to the Python library ConnectorX which 
recently changed Rust backend from `arrow2` to `arrow` 
([issue](https://github.com/sfu-db/connector-x/issues/811))
   - Pinned down the bug to Connectorx's change from specifying timezone via 
`"UTC"` to `"+00:00"` ([why the change was 
made](https://github.com/sfu-db/connector-x/issues/735), [the 
change](https://github.com/sfu-db/connector-x/pull/743))
   - Collaborating with Connectorx maintainers, we noticed that 
`TimestampScalar.__str__()` and `TimestampScalar.__repr__()` behaved differently
   
   With this new knowledge, I will have to go back to dlt code for debugging. I 
suspect that 
[pyarrow.csv.CSVWriter().write()](https://github.com/apache/arrow/blob/d2a171805c63caa27f05232695b753e07c32cb1d/python/pyarrow/_csv.pyx#L1513)
 uses string formatting for dumping values to file
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to