[GitHub] [arrow] lidavidm commented on issue #10899: Does feather file identify with pyarrow.memory_map(file)?

GitBox Mon, 09 Aug 2021 07:56:58 -0700


lidavidm commented on issue #10899:
URL: https://github.com/apache/arrow/issues/10899#issuecomment-895294381



   The Feather (V2) file format, also known as the Arrow IPC file format, is 
neither CSV nor Parquet, but rather Arrow's format for data on disk. See this 
[FAQ 
entry](https://arrow.apache.org/faq/#what-is-the-difference-between-apache-arrow-and-apache-parquet)
 as well as the one immediately after it. 
   
   The article you link is describing using uncompressed Feather/Arrow IPC 
files. This is because then the layout of data in memory is the same as the 
layout on disk, due to the Arrow specification, and you can memory-map the file 
and use it as-is. Of course, you can still memory-map a Parquet or CSV file - 
but you will have to decode the data first, which carries overhead. (This may 
still be manageable, e.g. you could decode and process one row group of a 
Parquet file at a time, but you won't gain the 'zero copy' benefits.)
   
   For analysis of data, it depends. PyArrow has some [compute 
functions](https://arrow.apache.org/docs/python/api/compute.html) available and 
they may be sufficient for your needs. Else you may convert to Pandas or some 
other format as needed, but of course this increases your memory usage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] lidavidm commented on issue #10899: Does feather file identify with pyarrow.memory_map(file)?

Reply via email to