zilto opened a new issue, #47043: URL: https://github.com/apache/arrow/issues/47043
### Describe the bug, including details regarding any error messages, version, and platform. ## Summary A valid `pyarrow.TimestampScalar` value throws an error when its `.__repr__()` is called. This is specific to when timezones are specified as offsets `+07:30`. It doesn't occur with IANA `"America/New_York"` and UTC-like `"UTC"`. ```python from datetime import datetime import pyarrow py_value = datetime(2012, 1, 1) no_tz = pyarrow.scalar(py_value, type=pyarrow.timestamp("s")) no_tz.__str__() # '2012-01-01 00:00:00' no_tz.__repr__() # "<pyarrow.TimestampScalar: '2012-01-01T00:00:00'>" utc_tz = pyarrow.scalar(py_value, type=pyarrow.timestamp("s", tz="UTC")) utc_tz.__str__() # '2012-01-01 00:00:00' utc_tz.__repr__() # "<pyarrow.TimestampScalar: '2012-01-01T00:00:00'>" offset_tz = pyarrow.scalar(py_value, type=pyarrow.timestamp("s", tz="+00:00")) offset_tz.__str__() # '2012-01-01 00:00:00' offset_tz.__repr__() # pyarrow.lib.ArrowInvalid: Cannot locate timezone '+00:00': +00:00 not found in timezone database ``` note: I have limited understanding of `cdef`, `.pyx`, `.pixi`, etc. ## Resources Useful link and resources in my investigation: - [cdef of TimestampScalar](https://github.com/apache/arrow/blob/d2a171805c63caa27f05232695b753e07c32cb1d/python/pyarrow/scalar.pxi#L768) overrides the `Scalar.__repr__()`. It uses `pyarrow.compute.strftime()` - `pyarrow.compute.strftime` is [delegated to the kernel](https://github.com/apache/arrow/blob/d2a171805c63caa27f05232695b753e07c32cb1d/cpp/src/arrow/compute/api_scalar.cc#L926) - The [kernel definition of strftime](https://github.com/apache/arrow/blob/d2a171805c63caa27f05232695b753e07c32cb1d/cpp/src/arrow/compute/kernels/scalar_temporal_unary.cc#L1166) - The [source of the raised error](https://github.com/apache/arrow/blob/d2a171805c63caa27f05232695b753e07c32cb1d/cpp/src/arrow/compute/kernels/temporal_internal.h#L51) - Asking the docs chatbot `What are the valid ways to specify the timezone of a timestamp value?` suggests that IANA, UTC, and offsets are valid options. They are the supported conventiosn in C++ `TimestampType` and Python `pyarrow.timestamp` ## Background - This bug was initially found via `pyarrow.csv.CSVWriter().write()` in the library Python dlt. - I initially thought it was related to the Python library ConnectorX which recently changed Rust backend from `arrow2` to `arrow` ([issue](https://github.com/sfu-db/connector-x/issues/811)) - Pinned down the bug to Connectorx's change from specifying timezone via `"UTC"` to `"+00:00"` ([why the change was made](https://github.com/sfu-db/connector-x/issues/735), [the change](https://github.com/sfu-db/connector-x/pull/743)) - Collaborating with Connectorx maintainers, we noticed that `TimestampScalar.__str__()` and `TimestampScalar.__repr__()` behaved differently With this new knowledge, I will have to go back to dlt code for debugging. I suspect that [pyarrow.csv.CSVWriter().write()](https://github.com/apache/arrow/blob/d2a171805c63caa27f05232695b753e07c32cb1d/python/pyarrow/_csv.pyx#L1513) uses string formatting for dumping values to file ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org