wgtmac commented on PR #197: URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1700283600
> As far a performance goes, writing the indexes took 100s of microseconds vs total write times in the seconds 😄 Actually generating the histograms was a larger impact than writing them. Do you have the time spent on collecting the histograms? And what about the average number of records per page and total number of records in the file? The reason I ask for this is that number of pages can significantly affect the page index size. @etseidl From the above result, I am not so worried about the boost in the column index size. IMHO, though the initial design goal of page index is mainly for page filtering, OffsetIndex can be used individually for better I/O planning of pages instead of blindly to read them in sequence. Therefore I do not object to add `SizeStatistics` to the ColumnIndex. The downsize is that people do not need this info have to pay for I/O and thrift deserialization of the SizeStatistics. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
