Hugo-loio commented on issue #38552:
URL: https://github.com/apache/arrow/issues/38552#issuecomment-1793492818

   I tried setting the `use dictionary = false` option when writing, it might 
have reduced the size of the file a bit, but the rest of the problems remain. I 
can't use the `DELTA_BINARY_PACKED` option because my columns have floats.
   Honestly right now I think I will solve this issue by switching to csv. This 
seems strange but the same dataset is quite smaller when I save to a csv file 
(200 MB vs the 700 MB of the parquet) and I don't get the memory problems when 
reading. Here is the same script, but with parquet replaced by csv.
   ```python
   import numpy as np
   import pandas as pd
   import os
   import gc
   
   array_length = 2**20
   num_arrays = 10
   filename = "test.csv"
   
   # Only run this if the data file hasn't been already generated
   if(not os.path.isfile(filename)):
   
       # Generate arrays and write them to a dataframe
       print("Generating the dataframe...")
       columns = np.arange(1, array_length + 1)
       data = pd.DataFrame(columns = columns)
       for i in range(1,num_arrays+1):
           array = np.random.rand(array_length)
           data.loc[i] = array
       print("Done")
   
       # Convert the dataframe to arrow and then save to disk as parquet
       print("Saving to disk...")
       data.to_csv(filename)
       print("Done")
   
   # Reading the data
   print("Reading the data...")
   pd.read_csv(filename)
   print("Done")
   ```
   
   I wanted to use parquet because I read that the sizes of the files were 
smaller and that the read write speeds were faster, compared to csv, so I 
wouldn't expect this type of behaviour. At least I hope that I was able to 
simply reproduce the problem so it can be solved in the future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to