emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1151338163
##########
src/main/thrift/parquet.thrift:
##########
@@ -190,6 +190,41 @@ enum FieldRepetitionType {
/** The field is repeated and can contain 0 or more values */
REPEATED = 2;
}
+/**
+ * A structure for capturing metadata for estimating the unencoded,
uncompressed size
+ * of data.
+ *
+ * Writers should populate all fields in this struct except for the exceptions
listed per field.
+ */
+struct SizeEstimationStatistics {
+ /**
+ * The number of logical physical bytes stored for BYTE_ARRAY data values.
Logical bytes refers to the number
+ * of bytes needed if no special encoding is used. This is exclusive of the
bytes needed
+ * to store the length of each byte array. In other words, this field is
equivelant to the the (size of
+ * PLAIN-ENCODING the byte array values) - (4 bytes * number of values
written). To determine logical sizes
+ * of other other types readers can use schema information multiplied by
the number of non-null values.
+ * The number of non-null values can be inferred from the histograms below.
+ *
+ * For example if column chunk is dictionary encoded with a dictionary
["a", "bc", "cde"] and a data page
+ * has indexes [0, 0, 1, 2]. This value is expected to be 7 (1 + 1 + 2 +
3).
+ *
+ * This option should only be set for physical and logical types that would
use BYTE_ARRAY when encoded with PLAIN encoding.
+ */
+ 1: optional i64 logical_variable_width_stored_bytes;
+ /**
+ * When present there is expected to be one element corresponding to each
repetition (i.e. size=max repetition_level+1)
+ * where each element represens the number of time the repetition level
was observed in the data.
+ *
+ * This value is optional if max_repetition_level is 0.
+ */
+ 2: optional list<i64> repetition_level_histogram;
+ /**
+ * Same as repetition_level_histogram except for definition levels.
+ *
+ * This value is optional when max_definition_level is 0.
+ */
+ 3: optional list<i64> definition_level_histogram;
Review Comment:
It might pay to illustrate exact queries, but if this is just answering a
question is there any null element at a particular nesting level I think
definition level histogram by itself gives that information.
Take a nested lists where both lists and elements can be nullable at each
level. IIRC, the definition levels would represent as follows:
0 - Null top level list.
1 - empty top level list
2 - null nested list
3 - empty nested list
4 - null leaf element
5 - present leaf element
So if the query is for top level list `is null`, one could prune when
`def_level[0] == 0`. For `is not null` one could prune if `def_level[0] ==
num_values from page (i.e. all values are null)`.
I believe similar logic holds for `def_level[2]` but could get more
complicated depending on the semantics of whether a top level null element
should imply a the nested list is also null or if an empty list implies the
nested list should be considered null (but should still be derivable by using
histogram indices 0,1 and 2).
One thing the joint histogram (pairs of rep/def level counts) could give you
is the number first list elements that are null, but I'm not sure how useful
that is. I would need to think about other queries the joint histogram would
enable (or if you have more examples of supported queries we can figure out if
one is needed).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]