[GitHub] [arrow] westonpace commented on issue #35894: parquet.read_table opens files twice

via GitHub Tue, 20 Jun 2023 10:00:53 -0700


westonpace commented on issue #35894:
URL: https://github.com/apache/arrow/issues/35894#issuecomment-1599178615


   > it seems that was an issue within the Dataset API
   
   Yes, however, pretty much all reads are using the dataset API internally 
now.  The paths have been merged for simplicity of maintenance.  `read_table` 
will create a dataset with one file and then read it.  The first time the file 
is opened happens when the dataset is created (to get the schema of the file).  
I am pretty sure it isn't actually reading the schema twice though.  I think it 
is something like...
   
    * Create dataset
      * open file
      * read metadata
      * close file
    * Read dataset
      * open file
      * read data
      * close file
   
   I agree that it is somewhat less than ideal that the file is opened twice.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on issue #35894: parquet.read_table opens files twice

Reply via email to