kevinjqliu commented on issue #3168:
URL: 
https://github.com/apache/iceberg-python/issues/3168#issuecomment-4100157344

   Thanks for reporting this! 
   
   > The to_arrow_batch_reader does not help here either, because -as per my 
understanding- in the batch reader of pyiceberg each batch represents an 
individual datafile. Hence, if there is one problematic 6MB datafile, it makes 
no difference if you use the batch reader or not. I also have the impression 
that when you iterate over the reader, pyarrow has already loaded the parquet 
file in a separate thread and this is where the memory explosion actually 
happens.
   
   This is a bug. I've found out about this recently (see 
https://github.com/apache/iceberg-python/discussions/3122). 
`to_arrow_batch_reader` should read from a single parquet file 1 batch at a 
time; thus reducing the memory consumption. 
   
   https://github.com/apache/iceberg-python/pull/2676 is the proper fix to 
correct this behavior. Would love to see if this helps with your issue. 
   
   
   > There should be an option somewhere, e.g. in the data_scan to specify for 
which columns dictionary encoding should be used. This option should be 
forwarded to pyarrow internally somehow, so that pyarrow uses less memory.
   
   I think thats a reasonable feature request. Ive opened 
https://github.com/apache/iceberg-python/issues/3170 to track this
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to