wgtmac commented on code in PR #3556:
URL: https://github.com/apache/parquet-java/pull/3556#discussion_r3360643554
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java:
##########
@@ -161,6 +161,7 @@ public static enum JobSummaryLevel {
public static final String BLOCK_ROW_COUNT_LIMIT =
"parquet.block.row.count.limit";
public static final String PAGE_ROW_COUNT_LIMIT =
"parquet.page.row.count.limit";
public static final String PAGE_WRITE_CHECKSUM_ENABLED =
"parquet.page.write-checksum.enabled";
+ public static final String DICTIONARY_CHECK_AFTER_BYTES =
"parquet.dictionary.check.after.raw.bytes";
Review Comment:
Could we also document `parquet.dictionary.check.after.raw.bytes` in the
configuration list above? It would be useful to mention that this is based on
raw value bytes, and nulls encoded in definition levels do not contribute to
this threshold.
##########
parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java:
##########
@@ -709,6 +725,19 @@ public Builder withPageWriteChecksumEnabled(boolean val) {
return this;
}
+ /**
+ * Set the raw data byte threshold after which the dictionary compression
check is performed.
+ * A value of 0 means check on the first page (backward compatible
default). Higher values
+ * delay the check until that many raw bytes have been accumulated across
pages.
+ *
+ * @param val byte threshold (default: 0)
+ * @return this builder for method chaining
+ */
+ public Builder withDictionaryCheckAfterBytes(long val) {
+ this.dictionaryCheckAfterBytes = val;
Review Comment:
Should we reject negative values here? A negative threshold effectively
behaves like `0`, but accepting it silently seems a bit confusing for a
size-like config. Most nearby size/count options validate the input, so `val >=
0` would be clearer.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]