westonpace commented on issue #10138: URL: https://github.com/apache/arrow/issues/10138#issuecomment-827996049
> Any document related to on-disk-storage of feather format? The feather format is more generally referred to now as the "Arrow IPC File Format" (not to be confused with Arrow IPC Streaming Format) and is documented [here](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format). > I've found that read several columns of a feather DataFrame is slower than read the entire file Currently, if you only select some columns, the Arrow IPC reader will still read in the entire table from disk. The feather/IPC format does not currently implement partial-record-batch reads of any kind. Out of curiosity which OS are you running? When I run your reproduction script I do see results similar to what you have posted. However, I believe this is due to issues with the reproduction script. Generating new files each time will not emulate a clean cache. Instead, when you go to your benchmark's read phase, you will have a bunch of dirty OS pages in the cache (but they will be in the cache and the OS will serve them from the cache). Just to reinforce this, you are getting read speeds of 1GB/s (which includes decoding time). Unless you have a very fast SSD that seems unlikely. If you are intending a hot-cache test then you should put `os.sync` after each of the writes (or, more easily, just write the files once in setup). If you are intending a cold-cache test then you should put `os.sync` AND something like... ``` # with open('/proc/sys/vm/drop_caches', 'w') as f: # f.write("1\n") ``` ...after each write. When I run a hot-cache test I see consistently faster times the fewer columns I am loading. When I run a cold-cache test I see so much noise from the actual disk I/O that it is very hard to see the benefit of column selection (although I presume it is still there). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
