MMCMA opened a new issue, #38260:
URL: https://github.com/apache/arrow/issues/38260
### Describe the bug, including details regarding any error messages,
version, and platform.
We experierence a massiv drop in performance when using pandas 2.1.1 vs.
pandas 1.5.3 when invoking pa.Table.from_pandas().
In this example, the conversion time increased from roughly 2.9 seconds to
16.2 seconds. In our data application the problem is evern more dramatic since
the size of the dataframe is larger - it seems very sensitive to the number of
columns. 2x number of columns yields roughly 4x compute time (`num_cols=20000 `
vs. `num_cols=40000`). Not sure if this should be raised also with pandas.
import pyarrow as pa
import pandas as pd
import numpy as np
import timeit
num_cols = 20000
num_dates = 8800
dates = pd.date_range(start='19900101', freq='b', periods=num_dates)
data = numpy.random.randint(low=0, high=10, size=(num_dates, num_cols))
df = pd.DataFrame(data, index=dates)
tic = timeit.default_timer()
pa.Table.from_pandas(df, preserve_index=True)
total_time = timeit.default_timer() - tic
print(f'Conversion from pandas to pyarrow took {total_time} seconds')
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]