etseidl commented on PR #197: URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1700294131
> Do you have the time spent on collecting the histograms? And what about the average number of records per page and total number of records in the file? The reason I ask for this is that number of pages can significantly affect the page index size. @wgtmac I'll have to get back to you on that (the data is on my work computer 😅). The number of rows per page should be around 20000 (but can be a little lower due to `max_page_size constraints`), but the records per page can vary wildly in the nested files. I'll get some exact times tomorrow, but IIRC for the "flat 1" file, the histogram collection was under 30ms once I figured out how to do that part in parallel (it had been over 60ms with a serial implementation). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
