[I] Massive performance deterioation with pandas 2.1.1 vs. 1.5.3 when calling pa.Table.from_pandas() [arrow]

via GitHub Fri, 13 Oct 2023 03:41:02 -0700


MMCMA opened a new issue, #38260:
URL: https://github.com/apache/arrow/issues/38260


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   We experierence a massiv drop in performance when using pandas 2.1.1 vs. 
pandas 1.5.3 when invoking pa.Table.from_pandas().
   In this example, the conversion time increased from roughly 2.9 seconds to 
16.2 seconds. In our data application the problem is evern more dramatic since 
the size of the dataframe is larger - it seems very sensitive to the number of 
columns. 2x number of columns yields roughly 4x compute time (`num_cols=20000 ` 
vs.  `num_cols=40000`). Not sure if this should be raised also with pandas.
   
       import pyarrow as pa
       import pandas as pd
       import numpy as np
       import timeit
   
       num_cols = 20000
       num_dates = 8800
       dates = pd.date_range(start='19900101', freq='b', periods=num_dates)
       data = numpy.random.randint(low=0, high=10, size=(num_dates, num_cols))
       df = pd.DataFrame(data, index=dates)
   
       tic = timeit.default_timer()
       pa.Table.from_pandas(df, preserve_index=True)
       total_time = timeit.default_timer() - tic
       print(f'Conversion from pandas to pyarrow took {total_time} seconds')
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Massive performance deterioation with pandas 2.1.1 vs. 1.5.3 when calling pa.Table.from_pandas() [arrow]

Reply via email to