[I] Improving the pa.Table.from_pylist method [arrow]

via GitHub Tue, 15 Oct 2024 11:34:54 -0700


lllangWV opened a new issue, #44423:
URL: https://github.com/apache/arrow/issues/44423


   ### Describe the enhancement requested
   
   Hey,
   
   I'm opening this issue to suggest a possible performance improvement for the 
`pa.Table.from_pylist` method. Below is a script where I create a table from a 
list of dictionaries in two ways:
   
   1. Using `pa.Table.from_pylist`.
   2. Initializing the data with `pyarrow.array`, creating the schema with 
`pa.schema(incoming_array.type)`, and then using `pa.Table.from_arrays`.
   
   I ran a comparison using 100,000 rows, and the second method showed a ~4x 
speedup. I believe this speedup will scale with larger datasets. The 
performance difference seems to occur because the `pa.Table.from_pylist` 
method, as implemented in 
[table.pxi](https://github.com/apache/arrow/blob/main/python/pyarrow/table.pxi),
 processes the list in Python. In contrast, by initializing the data as arrays, 
the processing is handled in Cython, leading to better performance.
   
   To verify, I sorted the column names in both tables, as there appears to be 
some sorting happening under the hood in the `pa.libArray` implementation. 
After sorting, the tables are identical.
   
   ### Results:
   ```bash
   Time to load using pa.Table.from_pylist: 8.03 seconds
   Time to load using array initialization and pa.Table.from_arrays: 1.82 
seconds
   Are the tables equal: True
   ```
   
   ### Code:
   ```python
   import random
   import pyarrow as pa
   import time
   
   def generate_data(n_rows=100, n_columns=100):
       data = []
       for _ in range(n_rows):
           data.append({f'col_{i}': random.randint(0, 100000) for i in 
range(n_columns)})
       return data
   
   # Generate data
   data = generate_data(n_rows=100000, n_columns=100)
   
   # First method
   start_time = time.time()
   table1 = pa.Table.from_pylist(data)
   sorted_names = sorted(table1.column_names)
   table1 = table1.select(sorted_names)
   print(f"Time to load using pa.Table.from_pylist: {time.time() - start_time} 
seconds")
   
   # Second method
   start_time = time.time()
   array = pa.array(data)
   schema = pa.schema(array.type)
   table2 = pa.Table.from_arrays(array.flatten(), schema=schema)
   sorted_names = sorted(table2.column_names)
   table2 = table2.select(sorted_names)
   print(f"Time to load using array initialization and pa.Table.from_arrays: 
{time.time() - start_time} seconds")
   
   # Compare tables
   print(f"Are the tables equal: {table1.equals(table2)}")
   ```
   
   Let me know if you need further details!
   
   Best regards,  
   Logan Lang
   
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Improving the pa.Table.from_pylist method [arrow]

Reply via email to