zeroshade commented on issue #14007: URL: https://github.com/apache/arrow/issues/14007#issuecomment-1233039005
> If I understand, we will be retaining the min and max for every row group till the end of the file. However, we can drop these statistics once a row group is written as we don't need to keep them around for parquet. While you don't *need* to keep them around for Parquet, it's often good to do so if there are mulitple Row Groups in a file. The file level metadata contains information on each row group in the file including where in the file those row groups start and end. By including the row group level statistics in the file-level metadata, a reader that is performing filtering or otherwise could benefit from the statistics can skip entire row groups without having to separately read the row group metadata. That's why this information is included in the file level metadata. Since the file-level metadata isn't written until the very end of the file, it's necessary to keep this information around (albeit in thrift serialized form). > Also, it would only be a problem if there are many RG's, columns and large min max values. Typically most query engines actually suggest using only a single row group rather than many row groups in a single file, so I don't think it's particularly worth-while to look into avoiding keeping the RG statistics long term. However, it would definitely make sense to update the ByteArray types to copy when Encoding so as to avoid this memory issue. If you want to file the PR, go for it! otherwise i'd be happy to create the PR to fix this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
