westonpace commented on issue #13949: URL: https://github.com/apache/arrow/issues/13949#issuecomment-1224980434
I'm not an expert in the Java side of things but I am pretty familiar with how the dataset scanner (which is in C++) works. The dataset scanner is going to try and read multiple files at the same time. In fact, with parquet, it will actually try and concurrently read multiple batches within a file. In addition, the dataset scanner is going to readahead a certain amount. For example, even if you only ask for one batch it will read more than one batch. It tries to accumulate enough of a "buffer" that an I/O slowdown won't cause a hitch in processing (this is very similar, for example, to the type of buffering that happens when you watch a youtube video). > I expect Arrow to use the amount of memory that corresponds to a single batch multiplied by the amount of files, but in reality the memory used is much more then the entire files. This is not quite accurate because of the above readahead. > The files were created with pandas default config (using pyarrow), and reading them in java gives the correct values. How many record batches are in each file? Do you know roughly how large each record batch is? > Reading 20 uncompressed parquet files with total size 3.2GB, takes more then 12GB in RAM, when reading them "concurrently". Is this 3.2GB per parquet file? Or 3.2GB across all parquet files? In 9.0.0 I think the default readahead configuration will read ahead up to 4 files and aims for about 2Mi rows per file. However, the Arrow datasets parquet reader will only read entire row groups. So, for example, if your file is one row group with 20Mi rows then it will be forced to read all 20Mi rows. I believe the pandas/pyarrow default will create 64Mi rows per row group. So if each file is 3.2GB and each file is a single row group then I would expect to see about 4 files worth of data in memory as part of the readahead which is pretty close to 12GB of RAM. You can tune the readahead (at least in C++) and the row group size (during the write) to try and find something workable with 9.0.0. Ultimately though I think we will want to someday support partial row group reads from parquet (we should be able to aim for page level resolution). This is tracked by https://issues.apache.org/jira/browse/ARROW-15759 but I'm not aware of anyone working on this at the moment so for the current time I think you are stuck with controlling the size of row groups that you are writing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
