emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1149682895
##########
src/main/thrift/parquet.thrift:
##########
@@ -190,6 +190,35 @@ enum FieldRepetitionType {
/** The field is repeated and can contain 0 or more values */
REPEATED = 2;
}
+/**
+ * A structure for capturing metadata for estimating the unencoded,
uncompressed size
+ * of data.
+ */
+struct SizeEstimationStatistics {
+ /**
+ * The number of logic bytes needed to store present/non-null values.
+ * Unless specified below, the computed size is the size it would take to
plain-encode the underlying
+ * physical type.
+ * Special calculations:
+ * - Enum: plain-encoded BYTE_ARRAY size
+ * - Integers (same size used for signed and unsigned): int8 - 1 bytes,
int16 - 2
+ * - Decimal - Each value is assumed to take the minimal number of bytes
necessary to encode
Review Comment:
I originally had this. I think given the two different opinions expressed,
I'm going to change this field to only record variable width bytes, and say all
other calcutions can be performed by readers based on type and number of values
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]