icexelloss opened a new issue, #34210:
URL: https://github.com/apache/arrow/issues/34210
### Describe the bug, including details regarding any error messages,
version, and platform.
I am testing performance of casting datatypes with pyarrow Table and saw
some unexpected performance.
In short, it seems that casting a column from "tz-naive" to "tz-utc" is much
slower than casting from "tz-naive" to "int64", which is unexpected because I
think both of these should be metadata-only change.
Here is a partial repo:
```
In [5]: df = pd.DataFrame({'time': np.arange(100 * 1000 * 1000)})
In [6]: table = pa.Table.from_pandas(df)
In [8]: schema_naive = pa.schema([pa.field('time' , pa.timestamp('ns'))])
In [9]: schema_tz = pa.schema([pa.field('time' , pa.timestamp('ns',
tz='UTC'))])
In [10]: table = table.cast(schema_naive)
In [14]: schema_int = pa.schema([pa.field('time' , pa.int64()))])
In [16]: %time table.cast(schema_int)
CPU times: user 114 µs, sys: 30 µs, total: 144 µs
Wall time: 231 µs
Out[16]:
pyarrow.Table
time: int64
----
time: [[0,1,2,3,4,...,99999995,99999996,99999997,99999998,99999999]]
In [17]: %time table.cast(schema_tz)
CPU times: user 119 ms, sys: 140 ms, total: 260 ms
Wall time: 259 ms
Out[17]:
pyarrow.Table
time: timestamp[ns, tz=UTC]
----
time: [[1970-01-01 00:00:00.000000000,1970-01-01
00:00:00.000000001,1970-01-01 00:00:00.000000002,1970-01-01
00:00:00.000000003,1970-01-01 00:00:00.000000004,...,1970-01-01
00:00:00.099999995,1970-01-01 00:00:00.099999996,1970-01-01
00:00:00.099999997,1970-01-01 00:00:00.099999998,1970-01-01 00:00:00.099999999]]
In [18]: table
Out[18]:
pyarrow.Table
time: timestamp[ns]
----
time: [[1970-01-01 00:00:00.000000000,1970-01-01
00:00:00.000000001,1970-01-01 00:00:00.000000002,1970-01-01
00:00:00.000000003,1970-01-01 00:00:00.000000004,...,1970-01-01
00:00:00.099999995,1970-01-01 00:00:00.099999996,1970-01-01
00:00:00.099999997,1970-01-01 00:00:00.099999998,1970-01-01 00:00:00.099999999]]
In [19]: pa.__version__
Out[19]: '11.0.0'
```
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]