[GitHub] [arrow] westonpace commented on issue #10138: feather read a part of columns slower than read the entire file

GitBox Tue, 27 Apr 2021 18:02:40 -0700


westonpace commented on issue #10138:
URL: https://github.com/apache/arrow/issues/10138#issuecomment-827996049



   > Any document related to on-disk-storage of feather format?
   
   The feather format is more generally referred to now as the "Arrow IPC File 
Format" (not to be confused with Arrow IPC Streaming Format) and is documented 
[here](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format).
   
   > I've found that read several columns of a feather DataFrame is slower than 
read the entire file
   
   Currently, if you only select some columns, the Arrow IPC reader will still 
read in the entire table from disk.  The feather/IPC format does not currently 
implement partial-record-batch reads of any kind.
   
   Out of curiosity which OS are you running?
   
   When I run your reproduction script I do see results similar to what you 
have posted.  However, I believe this is due to issues with the reproduction 
script.  Generating new files each time will not emulate a clean cache.  
Instead, when you go to your benchmark's read phase, you will have a bunch of 
dirty OS pages in the cache (but they will be in the cache and the OS will 
serve them from the cache).
   
   Just to reinforce this, you are getting read speeds of 1GB/s (which includes 
decoding time).  Unless you have a very fast SSD that seems unlikely.
   
   If you are intending a hot-cache test then you should put `os.sync` after 
each of the writes (or, more easily, just write the files once in setup).  If 
you are intending a cold-cache test then you should put `os.sync` AND something 
like...
   ```
           # with open('/proc/sys/vm/drop_caches', 'w') as f:
           #     f.write("1\n")
   ```
   ...after each write.  When I run a hot-cache test I see consistently faster 
times the fewer columns I am loading.  When I run a cold-cache test I see so 
much noise from the actual disk I/O that it is very hard to see the benefit of 
column selection (although I presume it is still there).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on issue #10138: feather read a part of columns slower than read the entire file

Reply via email to