wgtmac commented on code in PR #3556:
URL: https://github.com/apache/parquet-java/pull/3556#discussion_r3360641976
##########
parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java:
##########
@@ -79,8 +92,8 @@ public long getBufferedSize() {
@Override
public BytesInput getBytes() {
- if (!fellBackAlready && firstPage) {
- // we use the first page to decide if we're going to use this encoding
+ if (!fellBackAlready && !compressionChecked && cumulativeRawBytes >=
checkAfterBytes) {
Review Comment:
Could we add a test where the threshold is crossed only after a `reset()`?
This is the exact case we want to protect here, since `rawDataByteSize` is
reset per page while `cumulativeRawBytes` should keep accumulating across
pages. The current tests with `0` and `Long.MAX_VALUE` would not catch a
regression back to per-page counting.
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java:
##########
@@ -771,6 +771,17 @@ public SELF withPageWriteChecksumEnabled(boolean
enablePageWriteChecksum) {
return self();
}
+ /**
+ * Set the raw data byte threshold after which the dictionary compression
check is performed.
+ *
+ * @param val byte threshold (0 means check on every page)
Review Comment:
Nit: this says `0 means check on every page`, but the implementation only
checks once per column chunk, same as the old first-page behavior. Could we
update this to say `0` means check on the first page / preserve the previous
behavior?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]