zilto opened a new issue, #47043:
URL: https://github.com/apache/arrow/issues/47043
### Describe the bug, including details regarding any error messages,
version, and platform.
## Summary
A valid `pyarrow.TimestampScalar` value throws an error when its
`.__repr__()` is called. This is specific to when timezones are specified as
offsets `+07:30`. It doesn't occur with IANA `"America/New_York"` and UTC-like
`"UTC"`.
```python
from datetime import datetime
import pyarrow
py_value = datetime(2012, 1, 1)
no_tz = pyarrow.scalar(py_value, type=pyarrow.timestamp("s"))
no_tz.__str__() # '2012-01-01 00:00:00'
no_tz.__repr__() # "<pyarrow.TimestampScalar: '2012-01-01T00:00:00'>"
utc_tz = pyarrow.scalar(py_value, type=pyarrow.timestamp("s", tz="UTC"))
utc_tz.__str__() # '2012-01-01 00:00:00'
utc_tz.__repr__() # "<pyarrow.TimestampScalar: '2012-01-01T00:00:00'>"
offset_tz = pyarrow.scalar(py_value, type=pyarrow.timestamp("s",
tz="+00:00"))
offset_tz.__str__() # '2012-01-01 00:00:00'
offset_tz.__repr__()
# pyarrow.lib.ArrowInvalid: Cannot locate timezone '+00:00': +00:00 not
found in timezone database
```
note: I have limited understanding of `cdef`, `.pyx`, `.pixi`, etc.
## Resources
Useful link and resources in my investigation:
- [cdef of
TimestampScalar](https://github.com/apache/arrow/blob/d2a171805c63caa27f05232695b753e07c32cb1d/python/pyarrow/scalar.pxi#L768)
overrides the `Scalar.__repr__()`. It uses `pyarrow.compute.strftime()`
- `pyarrow.compute.strftime` is [delegated to the
kernel](https://github.com/apache/arrow/blob/d2a171805c63caa27f05232695b753e07c32cb1d/cpp/src/arrow/compute/api_scalar.cc#L926)
- The [kernel definition of
strftime](https://github.com/apache/arrow/blob/d2a171805c63caa27f05232695b753e07c32cb1d/cpp/src/arrow/compute/kernels/scalar_temporal_unary.cc#L1166)
- The [source of the raised
error](https://github.com/apache/arrow/blob/d2a171805c63caa27f05232695b753e07c32cb1d/cpp/src/arrow/compute/kernels/temporal_internal.h#L51)
- Asking the docs chatbot `What are the valid ways to specify the timezone
of a timestamp value?` suggests that IANA, UTC, and offsets are valid options.
They are the supported conventiosn in C++ `TimestampType` and Python
`pyarrow.timestamp`
## Background
- This bug was initially found via `pyarrow.csv.CSVWriter().write()` in the
library Python dlt.
- I initially thought it was related to the Python library ConnectorX which
recently changed Rust backend from `arrow2` to `arrow`
([issue](https://github.com/sfu-db/connector-x/issues/811))
- Pinned down the bug to Connectorx's change from specifying timezone via
`"UTC"` to `"+00:00"` ([why the change was
made](https://github.com/sfu-db/connector-x/issues/735), [the
change](https://github.com/sfu-db/connector-x/pull/743))
- Collaborating with Connectorx maintainers, we noticed that
`TimestampScalar.__str__()` and `TimestampScalar.__repr__()` behaved differently
With this new knowledge, I will have to go back to dlt code for debugging. I
suspect that
[pyarrow.csv.CSVWriter().write()](https://github.com/apache/arrow/blob/d2a171805c63caa27f05232695b753e07c32cb1d/python/pyarrow/_csv.pyx#L1513)
uses string formatting for dumping values to file
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]