zeroshade commented on issue #14007:
URL: https://github.com/apache/arrow/issues/14007#issuecomment-1233039005

   > If I understand, we will be retaining the min and max for every row group 
till the end of the file. However, we can drop these statistics once a row 
group is written as we don't need to keep them around for parquet.
   
   While you don't *need* to keep them around for Parquet, it's often good to 
do so if there are mulitple Row Groups in a file. The file level metadata 
contains information on each row group in the file including where in the file 
those row groups start and end. By including the row group level statistics in 
the file-level metadata, a reader that is performing filtering or otherwise 
could benefit from the statistics can skip entire row groups without having to 
separately read the row group metadata. That's why this information is included 
in the file level metadata. Since the file-level metadata isn't written until 
the very end of the file, it's necessary to keep this information around 
(albeit in thrift serialized form). 
   
   > Also, it would only be a problem if there are many RG's, columns and large 
min max values.
   
   Typically most query engines actually suggest using only a single row group 
rather than many row groups in a single file, so I don't think it's 
particularly worth-while to look into avoiding keeping the RG statistics long 
term. However, it would definitely make sense to update the ByteArray types to 
copy when Encoding so as to avoid this memory issue. If you want to file the 
PR, go for it! otherwise i'd be happy to create the PR to fix this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to