This is an automated email from the ASF dual-hosted git repository.
AlenkaF pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/main by this push:
new ea8cef531c GH-49875: [Python] Fix timezone dropped when converting
tz-aware Categorical to Arrow array (#49878)
ea8cef531c is described below
commit ea8cef531c0340fd1a92f9cca6a61634af33d806
Author: AnkitAhlawat <[email protected]>
AuthorDate: Wed May 6 13:20:58 2026 +0530
GH-49875: [Python] Fix timezone dropped when converting tz-aware
Categorical to Arrow array (#49878)
### Rationale for this change
When converting a pandas.Categorical with tz-aware datetime categories to a
PyArrow array, the timezone information was silently dropped from the
dictionary array's value type. This is a silent data loss bug — no warning or
error is raised, but the timezone metadata is lost.
### What changes are included in this PR?
In `python/pyarrow/array.pxi`, the Categorical conversion was using
`values.categories.values(raw numpy array) `which strips timezone metadata
since numpy does not support tz-aware datetimes. Changed to values.categories
(pandas Index) and added from_pandas=True so PyArrow uses the pandas conversion
path, which correctly preserves timezone metadata.
### Are these changes tested?
Yes. Verified manually
### Are there any user-facing changes?
Yes — this is a bug fix. Users did #49875
This PR contains a **"Critical Fix"** — timezone information was lost
silently during conversion without any warning or error.
* GitHub Issue: #49875
Authored-by: [email protected] <[email protected]>
Signed-off-by: AlenkaF <[email protected]>
---
python/pyarrow/array.pxi | 7 ++++---
python/pyarrow/tests/test_pandas.py | 9 +++++++++
2 files changed, 13 insertions(+), 3 deletions(-)
diff --git a/python/pyarrow/array.pxi b/python/pyarrow/array.pxi
index b7f3a46f9e..ecdbb342d3 100644
--- a/python/pyarrow/array.pxi
+++ b/python/pyarrow/array.pxi
@@ -356,8 +356,8 @@ def array(object obj, type=None, mask=None, size=None,
from_pandas=None,
values.codes, mask, index_type, memory_pool)
try:
dictionary = array(
- values.categories.values, type=value_type,
- memory_pool=memory_pool)
+ values.categories, type=value_type,
+ from_pandas=True, memory_pool=memory_pool)
except TypeError:
# TODO when removing the deprecation warning, this whole
# try/except can be removed (to bubble the TypeError of
@@ -371,7 +371,8 @@ def array(object obj, type=None, mask=None, size=None,
from_pandas=None,
"TypeError",
FutureWarning, stacklevel=2)
dictionary = array(
- values.categories.values, memory_pool=memory_pool)
+ values.categories, from_pandas=True,
+ memory_pool=memory_pool)
else:
raise
diff --git a/python/pyarrow/tests/test_pandas.py
b/python/pyarrow/tests/test_pandas.py
index 0339975f45..063532140c 100644
--- a/python/pyarrow/tests/test_pandas.py
+++ b/python/pyarrow/tests/test_pandas.py
@@ -3047,6 +3047,15 @@ class TestConvertMisc:
df['a'] = df['a'].astype('category')
_check_pandas_roundtrip(df)
+ def test_categorical_with_timezone(self):
+ # GH-49875: timezone was dropped when converting tz-aware categorical
+ cats = pd.DatetimeIndex(["2024-01-01",
"2024-01-02"]).tz_localize("US/Eastern")
+ cat = pd.Categorical(values=[cats[0], cats[1], cats[0]],
categories=cats)
+
+ arr = pa.array(cat, from_pandas=True)
+
+ assert arr.type.value_type.tz == "US/Eastern"
+
def test_empty_arrays(self):
for dtype_str, pa_type in self.type_pairs:
if (Version(pd.__version__) >= Version("3.0.0") and