[GitHub] [arrow] westonpace commented on issue #14880: Best practices for handling larger than memory data

GitBox Wed, 07 Dec 2022 15:20:35 -0800


westonpace commented on issue #14880:
URL: https://github.com/apache/arrow/issues/14880#issuecomment-1341728017


   > Thanks. Do you have any comments on the question about random chunking of 
rows and whether that offers any benefit? Or does partitioning a dataset only 
offer benefits if the partitioning is based on a sensible column?
   
   This will not modify the amount of memory used.
   
   > But this is seemingly just as slow as loading the full table. I had 
thought only ... columns would be read into memory so there would be a time 
savings.
   
   That should be the case.  I believe this was broken in r-arrow 9.  So if you 
are using version 9 you might try version 8 or 10.  We always read CSV data in 
blocks (usually 4MB) and we should drop the data you don't need after reading 
it in.  So even for CSV I would expect to see memory savings when using a 
select.  I'm afraid I don't know R enough to know why that didn't work.
   
   That being said, parquet is always a good idea too :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on issue #14880: Best practices for handling larger than memory data

Reply via email to