ssirovica commented on issue #14007:
URL: https://github.com/apache/arrow/issues/14007#issuecomment-1232469822

   Awesome, it is a relief to have the problem confirmed!
   
   Looking, I think it makes sense to implement the copy in `ToThrift` as 
that's only one place and the problem will be present for all types as min max 
are both `[]byte`, so a reference value.
   
   Code gen'ing the copies for every arrow type Encode() is feasible as well 
though.
   
   > The answer is that row group statistics are also stored in the file level 
metadata. This means that the file writer itself ends up holding onto the stats 
for the row group, and since it writes an entire record as one row group, and 
the record is just a single contiguous byte array, the file level statistics 
for each row group ends up forcing it to keep around the memory for every 
record. 
   
   Ah! So if I understand, we will be retaining the min and max for every row 
group till the end of the file. However, we should be able to drop these 
statistics once a row group is written as we don't need to keep them around for 
parquet.
   
   Would it be worth it to explore a way to not keep the RG statistics long 
term on the file level metadata? This would save us on copy and memory. With 
that said it's a more complex change. Also, it would only be a problem if there 
are many RG's, columns and large min max values.
   
   I'm also happy to try and take a stab at this in a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to