Tej-ashwani commented on issue #35332: URL: https://github.com/apache/arrow/issues/35332#issuecomment-1576497848
The code you provided demonstrates two different approaches for loading parquet datasets using PyArrow: one using pyarrow.dataset.Dataset.to_table() and the other using pyarrow.parquet.read_table(). You're observing that loading the whole dataset using to_table() is sometimes slower than using read_table(), and you're wondering why this is the case. The pyarrow.dataset.Dataset class is designed to provide a high-level interface for working with datasets. It allows you to perform various operations on the dataset, such as filtering, partitioning, and schema inference, before loading the data into memory. On the other hand, pyarrow.parquet.read_table() directly reads the parquet file into an Arrow table without any dataset-level operations. The performance difference you're observing could be attributed to a couple of factors: Metadata loading: When using to_table(), PyArrow needs to load the metadata of the dataset, which includes information about the file schema, partitioning, and other dataset properties. This additional step can introduce overhead, especially if the metadata is large or distributed across multiple files. Lazy loading: pyarrow.dataset.Dataset performs lazy loading, which means it doesn't load the entire dataset into memory immediately. Instead, it creates a logical plan for executing dataset operations and defers loading until necessary (e.g., when calling to_table()). Lazy loading allows for more efficient memory usage but can lead to longer initial loading times compared to read_table(), which eagerly reads the entire dataset. To confirm whether these factors contribute to the performance difference, you can try the following: Check the size of the metadata associated with the dataset. If it's substantial, loading the metadata could be a significant overhead. Measure the time it takes to load just the metadata using pyarrow.dataset.Dataset.get_metadata(). If this operation alone is noticeably slower than read_table(), it indicates that metadata loading is a contributing factor. Compare the loading times for smaller subsets of the dataset using both approaches. If the performance difference is more prominent with larger datasets, it further suggests that lazy loading in pyarrow.dataset.Dataset is causing the discrepancy. By understanding the specific characteristics of your dataset and the operations performed by pyarrow.dataset.Dataset, you can gain insights into the performance differences and optimize your loading strategy accordingly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
