emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148822969
##########
src/main/thrift/parquet.thrift:
##########
@@ -190,6 +190,35 @@ enum FieldRepetitionType {
/** The field is repeated and can contain 0 or more values */
REPEATED = 2;
}
+/**
+ * A structure for capturing metadata for estimating the unencoded,
uncompressed size
+ * of data.
+ */
+struct SizeEstimationStatistics {
+ /**
+ * The number of logic bytes needed to store present/non-null values.
+ * Unless specified below, the computed size is the size it would take to
plain-encode the underlying
+ * physical type.
+ * Special calculations:
+ * - Enum: plain-encoded BYTE_ARRAY size
+ * - Integers (same size used for signed and unsigned): int8 - 1 bytes,
int16 - 2
+ * - Decimal - Each value is assumed to take the minimal number of bytes
necessary to encode
+ * the precision of the decimal value.
+ * - Nested types (lists, nested groups and maps) - No additional size for
these structures
+ * are accounted for in this field, instead the histogram fields below
can be
+ * be used to estimate overhead to recreate these structures.
+ */
+ 1: optional i64 logical_value_byte_storage;
+ /**
+ * When present there is expected to be one element corresponding to each
repetition (i.e. size=max repetition_level+1)
+ * where each element represens the number of time the repetition level
was observed in the data.
+ */
+ 2: optional list<i64> repetition_level_histogram;
Review Comment:
There are a few things to consider here:
1. What happens if max rep/dep level is zero (should we require these).
This also relates should the size be max_dep_level + 1 or max_dep_level. The
first allows readers to sanity check the statistics sum to num_values, the
second does not
2. Should we require variable size bytes if the column doesn't have any (0
is an acceptable value here)?
3. it has kind of been drilled into me that any message that lives long
enough having a required field one will live to regret it. I'd prefer to
document that writers should populate relevant fields (and be specific about
when we believe they are relevant).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]