wgtmac commented on code in PR #3556:
URL: https://github.com/apache/parquet-java/pull/3556#discussion_r3394400538
##########
parquet-column/src/main/java/org/apache/parquet/column/values/fallback/FallbackValuesWriter.java:
##########
@@ -79,8 +91,13 @@ public long getBufferedSize() {
@Override
public BytesInput getBytes() {
- if (!fellBackAlready && firstPage) {
- // we use the first page to decide if we're going to use this encoding
+ try {
+ cumulativeRawBytes = Math.addExact(cumulativeRawBytes, rawDataByteSize);
+ } catch (ArithmeticException e) {
+ // overflow, keep the previous value
+ }
+ if (!fellBackAlready && !compressionChecked && cumulativeRawBytes >=
dictionaryCheckThresholdRawSizeBytes) {
+ compressionChecked = true;
BytesInput bytes = initialWriter.getBytes();
if (!initialWriter.isCompressionSatisfying(rawDataByteSize,
bytes.size())) {
Review Comment:
I think the threshold gate now accumulates raw bytes across pages, but the
compression check still compares the current page raw size against the current
page encoded size plus the whole accumulated dictionary size. This seems biased
toward fallback once the check is delayed beyond the first page. Should this
comparison use cumulative raw/encoded sizes, or otherwise avoid charging the
full column-chunk dictionary size to a single page? A finite-threshold test
with repeated/moderate-cardinality values across pages would help verify the
intended PARQUET-3479 case.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]