lllangWV opened a new issue, #44423: URL: https://github.com/apache/arrow/issues/44423
### Describe the enhancement requested Hey, I'm opening this issue to suggest a possible performance improvement for the `pa.Table.from_pylist` method. Below is a script where I create a table from a list of dictionaries in two ways: 1. Using `pa.Table.from_pylist`. 2. Initializing the data with `pyarrow.array`, creating the schema with `pa.schema(incoming_array.type)`, and then using `pa.Table.from_arrays`. I ran a comparison using 100,000 rows, and the second method showed a ~4x speedup. I believe this speedup will scale with larger datasets. The performance difference seems to occur because the `pa.Table.from_pylist` method, as implemented in [table.pxi](https://github.com/apache/arrow/blob/main/python/pyarrow/table.pxi), processes the list in Python. In contrast, by initializing the data as arrays, the processing is handled in Cython, leading to better performance. To verify, I sorted the column names in both tables, as there appears to be some sorting happening under the hood in the `pa.libArray` implementation. After sorting, the tables are identical. ### Results: ```bash Time to load using pa.Table.from_pylist: 8.03 seconds Time to load using array initialization and pa.Table.from_arrays: 1.82 seconds Are the tables equal: True ``` ### Code: ```python import random import pyarrow as pa import time def generate_data(n_rows=100, n_columns=100): data = [] for _ in range(n_rows): data.append({f'col_{i}': random.randint(0, 100000) for i in range(n_columns)}) return data # Generate data data = generate_data(n_rows=100000, n_columns=100) # First method start_time = time.time() table1 = pa.Table.from_pylist(data) sorted_names = sorted(table1.column_names) table1 = table1.select(sorted_names) print(f"Time to load using pa.Table.from_pylist: {time.time() - start_time} seconds") # Second method start_time = time.time() array = pa.array(data) schema = pa.schema(array.type) table2 = pa.Table.from_arrays(array.flatten(), schema=schema) sorted_names = sorted(table2.column_names) table2 = table2.select(sorted_names) print(f"Time to load using array initialization and pa.Table.from_arrays: {time.time() - start_time} seconds") # Compare tables print(f"Are the tables equal: {table1.equals(table2)}") ``` Let me know if you need further details! Best regards, Logan Lang ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
