[GitHub] [arrow] EB80 edited a comment on issue #11665: read_feather error with 32 GB file

GitBox Tue, 23 Nov 2021 00:44:56 -0800


EB80 edited a comment on issue #11665:
URL: https://github.com/apache/arrow/issues/11665#issuecomment-975690421



   I expect that the issue with arrow::read_feather was just because I had used 
the very old feather::write_feather to write the file.
   
   I have the following code to test arrow::write_feather:
   
   ```R
   rm(list = ls())
   
   # set the wd to be where this script is saved
   setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
   
   # set dimensions based on the real file
   numRows = 26e6 # 26M rows in the real file
   numCols = 150 # 150 columns in the real file
   
   # whip up a fake dataframe
   fakeDataframe <- as.data.frame(matrix("fake string", numRows, numCols))
   
   # change the column names for aesthetic purposes, I guess
   names(fakeDataframe) <- sprintf("Fake Column %s", 1:150)
   
   # save the fake file with data.table
   data.table::fwrite(fakeDataframe, "fakeFile.csv")
   
   # save the fake file with arrow
   arrow::write_feather(fakeDataframe, "fakeFile.feather")
   ```
   
   The fwrite step took about 10 minutes to write.  While the dimensions of the 
fake file match those of the real file, the size on the disk is much larger (46 
GB vice 32 GB).  I wrote a while loop to trim off rows from the fake file until 
it matched the real file, but object.size() was painfully slow.  Either way, I 
figured this would be a suitable test.  arrow::write_feather hangs with this 
fake dataframe just as before.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] EB80 edited a comment on issue #11665: read_feather error with 32 GB file

Reply via email to