lidavidm commented on issue #10899: URL: https://github.com/apache/arrow/issues/10899#issuecomment-895294381
The Feather (V2) file format, also known as the Arrow IPC file format, is neither CSV nor Parquet, but rather Arrow's format for data on disk. See this [FAQ entry](https://arrow.apache.org/faq/#what-is-the-difference-between-apache-arrow-and-apache-parquet) as well as the one immediately after it. The article you link is describing using uncompressed Feather/Arrow IPC files. This is because then the layout of data in memory is the same as the layout on disk, due to the Arrow specification, and you can memory-map the file and use it as-is. Of course, you can still memory-map a Parquet or CSV file - but you will have to decode the data first, which carries overhead. (This may still be manageable, e.g. you could decode and process one row group of a Parquet file at a time, but you won't gain the 'zero copy' benefits.) For analysis of data, it depends. PyArrow has some [compute functions](https://arrow.apache.org/docs/python/api/compute.html) available and they may be sufficient for your needs. Else you may convert to Pandas or some other format as needed, but of course this increases your memory usage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
