wgtmac commented on PR #35455: URL: https://github.com/apache/arrow/pull/35455#issuecomment-1578277465
> Did you try to benchmark writing the page index vs. writing statistics in data page headers? Perhaps in the future we can enable the page index by default? I didn't benchmark the writer time because it should spend more time when page index is enabled because: - It only skips writing stats to the thrift-encoded header but the stats comparison (which is heavy) in the column writer still does the job. - Page index builder also does more work than just serializing stats. It also collects sorting order from page stats, collects page offsets and so on. The main goal of skipping writing stats to page header mainly is to reduce the file size as they are duplicated and easier to get from the column index. We have internally enabled page index by default. The benefit brought by page index is promising. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
