etseidl commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318770227


##########
src/main/thrift/parquet.thrift:
##########
@@ -977,6 +1073,15 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list<i64> null_counts
+  /**
+    * Repetition and definition level histograms for the pages.
+    *
+    * This contains some redundancy with null_counts, however, to accommodate
+    * the widest range of readers both should be populated when either the max
+    * definition and repetition level meet the requirements specified in
+    * RepetitionDefinitionLevelHistogram.
+   **/
+  6: optional list<RepetitionDefinitionLevelHistogram> 
repetition_definition_level_histograms

Review Comment:
   > It just so happens that my application does this :). So yes, the more 
memory we waste, the less indexes we can keep in memory, the fewer cache hits 
we have, which would be a shame if it was preventable.
   > 
   > Yes, we could store the thrift-encoded data. But as layed out above, 
thrift-decoding also becomes more costly with a list-of-lists design. A 
ColumnIndex is, e.g., used for a point access (equality predicate on a sorted, 
partitioned table) and in this case decoding speed matters, as users expect a 
very quick answer for such a query that returns a single row.
   
   @JFinis point taken. I've been down the same rabbit holes as you :) Point 
access is important to me as well, but when I'm wearing a different hat :wink: 
Anyway, like I said, I'm not against your proposal, it would honestly make my 
life easier. I just thought momentum was against you. 
   
   You also make a good case for the column major ordering.  It's cache misses 
for me on the encoding side or cache misses for you on the read...in a 
write-once-read-many system I think your cache misses are the more important.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to