[GitHub] [arrow] Tej-ashwani commented on issue #35332: pyarrow.dataset.Dataset.to_table() much slower than pyarrow.parquet.read_table() when reading from S3

via GitHub Mon, 05 Jun 2023 03:01:58 -0700


Tej-ashwani commented on issue #35332:
URL: https://github.com/apache/arrow/issues/35332#issuecomment-1576497848


   
   The code you provided demonstrates two different approaches for loading 
parquet datasets using PyArrow: one using pyarrow.dataset.Dataset.to_table() 
and the other using pyarrow.parquet.read_table(). You're observing that loading 
the whole dataset using to_table() is sometimes slower than using read_table(), 
and you're wondering why this is the case.
   
   The pyarrow.dataset.Dataset class is designed to provide a high-level 
interface for working with datasets. It allows you to perform various 
operations on the dataset, such as filtering, partitioning, and schema 
inference, before loading the data into memory. On the other hand, 
pyarrow.parquet.read_table() directly reads the parquet file into an Arrow 
table without any dataset-level operations.
   
   The performance difference you're observing could be attributed to a couple 
of factors:
   
   Metadata loading: When using to_table(), PyArrow needs to load the metadata 
of the dataset, which includes information about the file schema, partitioning, 
and other dataset properties. This additional step can introduce overhead, 
especially if the metadata is large or distributed across multiple files.
   
   Lazy loading: pyarrow.dataset.Dataset performs lazy loading, which means it 
doesn't load the entire dataset into memory immediately. Instead, it creates a 
logical plan for executing dataset operations and defers loading until 
necessary (e.g., when calling to_table()). Lazy loading allows for more 
efficient memory usage but can lead to longer initial loading times compared to 
read_table(), which eagerly reads the entire dataset.
   
   To confirm whether these factors contribute to the performance difference, 
you can try the following:
   
   Check the size of the metadata associated with the dataset. If it's 
substantial, loading the metadata could be a significant overhead.
   
   Measure the time it takes to load just the metadata using 
pyarrow.dataset.Dataset.get_metadata(). If this operation alone is noticeably 
slower than read_table(), it indicates that metadata loading is a contributing 
factor.
   
   Compare the loading times for smaller subsets of the dataset using both 
approaches. If the performance difference is more prominent with larger 
datasets, it further suggests that lazy loading in pyarrow.dataset.Dataset is 
causing the discrepancy.
   
   By understanding the specific characteristics of your dataset and the 
operations performed by pyarrow.dataset.Dataset, you can gain insights into the 
performance differences and optimize your loading strategy accordingly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] Tej-ashwani commented on issue #35332: pyarrow.dataset.Dataset.to_table() much slower than pyarrow.parquet.read_table() when reading from S3

Reply via email to