[I] [C++][Parquet] splitting and saving big datasets consumes all available RAM and fails [arrow]

via GitHub Mon, 13 Nov 2023 14:06:24 -0800


vkhodygo opened a new issue, #38691:
URL: https://github.com/apache/arrow/issues/38691


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I have a dataset stored as `.csv` files, 1000 files, 1.000.000 records each. 
Id like to convert it to `.parquet` and split into logical partitions to save 
some space and ease access to it. However, when I try to do this my code 
consumes *all* available memory and fails spectacularly. 
   
   The code to get mock data is provided below. The actual values are a bit 
different in terms of distributions, but not to the point of orders of 
magnitude. 
   
   ```python
   import numpy as np
   import pyarrow as pa
   
   SAMPLE_SIZE = 1_000_000_000
   _rng = np.random.default_rng(123)
   
   _schema = pa.schema([pa.field(f'f{i}', pa.float32()) for i in range(10)] + 
                       [pa.field(f'i{i}', pa.uint8()) for i in range(37)] +
                       [pa.field(f'b{i}', pa.bool_()) for i in range(40)])
   
   data = pa.table({f'f{i}': np.random.randn(SAMPLE_SIZE) for i in range(10)} |
                   {f'i{i}': np.random.randint(1, 13, SAMPLE_SIZE, 
dtype=np.uint8) for i in range(1)} |
                   {f'i{i}': np.random.randint(1, 6, SAMPLE_SIZE, 
dtype=np.uint8) for i in range(1, 37)} |
                   {f'b{i}': np.random.randint(0, 2, SAMPLE_SIZE, 
dtype=np.uint8) for i in range(40)},  schema=_schema)
   ```
   
   I'd like to split this data the following way:
   
   ```python
   import pyarrow.dataset as ds
   file_options = ds.ParquetFileFormat().make_write_options(compression='gzip')
   
   ds.write_dataset(data,
                    base_dir="./syntet",
                    format="parquet",
                    file_options=file_options,
                    partitioning=ds.partitioning(
                        pa.schema([
                            ("b1", pa.bool_()),
                            ("i1", pa.uint8()),
                            ("i2", pa.uint8()),
                            ("i3", pa.uint8())
                        ]),
                        flavor="hive"
                    ),
                    existing_data_behavior="delete_matching"
                    )
   ``` 
   
   That's 600 partitions, give or take. Considering 1.000.000.000 records 
that's `~1.670.000` rows per partition. Strictly speaking, I'd like to decrease 
this number even further by splitting it into 50-100 subsets since this is only 
a sample of data. 
   
   I tried both Python and R to do this, the outcome stays the same. I also 
tried multiple machines with the following `RAM/swap` specs: `64/200`. 
`256/256`. `1024/8`, no result so far. 
   
   The only "working" solution is to read the whole dataset, filter a subset 
manually and save it. Wash, rinse, repeat; I think I had to read about 4TB of 
data in total during that, it also took somewhere around 10 hours to finish. 
This is clearly not sustainable.
    
   I also tried searching across other performance related issues, but most if 
not all of them are about *reading* data.
   No, I'm afraid I can't reduce the number of partitions, the only direction 
here is to go even further. When you do cross-table matching or targeted 
selection even these chunks are too big.
   
   ### Component(s)
   
   C++, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [C++][Parquet] splitting and saving big datasets consumes all available RAM and fails [arrow]

Reply via email to