westonpace commented on issue #13949:
URL: https://github.com/apache/arrow/issues/13949#issuecomment-1224980434

   I'm not an expert in the Java side of things but I am pretty familiar with 
how the dataset scanner (which is in C++) works.  The dataset scanner is going 
to try and read multiple files at the same time.  In fact, with parquet, it 
will actually try and concurrently read multiple batches within a file.
   
   In addition, the dataset scanner is going to readahead a certain amount.  
For example, even if you only ask for one batch it will read more than one 
batch. It tries to accumulate enough of a "buffer" that an I/O slowdown won't 
cause a hitch in processing (this is very similar, for example, to the type of 
buffering that happens when you watch a youtube video).
   
   > I expect Arrow to use the amount of memory that corresponds to a single 
batch multiplied by the amount of files, but in reality the memory used is much 
more then the entire files.
   
   This is not quite accurate because of the above readahead.
   
   > The files were created with pandas default config (using pyarrow), and 
reading them in java gives the correct values.
   
   How many record batches are in each file?  Do you know roughly how large 
each record batch is?
   
   > Reading 20 uncompressed parquet files with total size 3.2GB, takes more 
then 12GB in RAM, when reading them "concurrently".
   
   Is this 3.2GB per parquet file?  Or 3.2GB across all parquet files?
   
   In 9.0.0 I think the default readahead configuration will read ahead up to 4 
files and aims for about 2Mi rows per file.  However, the Arrow datasets 
parquet reader will only read entire row groups.  So, for example, if your file 
is one row group with 20Mi rows then it will be forced to read all 20Mi rows.  
I believe the pandas/pyarrow default will create 64Mi rows per row group.
   
   So if each file is 3.2GB and each file is a single row group then I would 
expect to see about 4 files worth of data in memory as part of the readahead 
which is pretty close to 12GB of RAM.
   
   You can tune the readahead (at least in C++) and the row group size (during 
the write) to try and find something workable with 9.0.0.  Ultimately though I 
think we will want to someday support partial row group reads from parquet (we 
should be able to aim for page level resolution).  This is tracked by 
https://issues.apache.org/jira/browse/ARROW-15759 but I'm not aware of anyone 
working on this at the moment so for the current time I think you are stuck 
with controlling the size of row groups that you are writing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to