Hugo-loio commented on issue #38552:
URL: https://github.com/apache/arrow/issues/38552#issuecomment-1793492818
I tried setting the `use dictionary = false` option when writing, it might
have reduced the size of the file a bit, but the rest of the problems remain. I
can't use the `DELTA_BINARY_PACKED` option because my columns have floats.
Honestly right now I think I will solve this issue by switching to csv. This
seems strange but the same dataset is quite smaller when I save to a csv file
(200 MB vs the 700 MB of the parquet) and I don't get the memory problems when
reading. Here is the same script, but with parquet replaced by csv.
```python
import numpy as np
import pandas as pd
import os
import gc
array_length = 2**20
num_arrays = 10
filename = "test.csv"
# Only run this if the data file hasn't been already generated
if(not os.path.isfile(filename)):
# Generate arrays and write them to a dataframe
print("Generating the dataframe...")
columns = np.arange(1, array_length + 1)
data = pd.DataFrame(columns = columns)
for i in range(1,num_arrays+1):
array = np.random.rand(array_length)
data.loc[i] = array
print("Done")
# Convert the dataframe to arrow and then save to disk as parquet
print("Saving to disk...")
data.to_csv(filename)
print("Done")
# Reading the data
print("Reading the data...")
pd.read_csv(filename)
print("Done")
```
I wanted to use parquet because I read that the sizes of the files were
smaller and that the read write speeds were faster, compared to csv, so I
wouldn't expect this type of behaviour. At least I hope that I was able to
simply reproduce the problem so it can be solved in the future.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]