wgtmac commented on PR #197:
URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1700283600

   > As far a performance goes, writing the indexes took 100s of microseconds 
vs total write times in the seconds 😄 Actually generating the histograms was a 
larger impact than writing them.
   
   Do you have the time spent on collecting the histograms? And what about the 
average number of records per page and total number of records in the file? The 
reason I ask for this is that number of pages can significantly affect the page 
index size. @etseidl 
   
   From the above result, I am not so worried about the boost in the column 
index size. IMHO, though the initial design goal of page index is mainly for 
page filtering, OffsetIndex can be used individually for better I/O planning of 
pages instead of blindly to read them in sequence. Therefore I do not object to 
add `SizeStatistics` to the ColumnIndex. The downsize is that people do not 
need this info have to pay for I/O and thrift deserialization of the 
SizeStatistics.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to