yugan95 opened a new pull request, #7956:
URL: https://github.com/apache/paimon/pull/7956
### Purpose
Pass through parquet statistics and page-size-check configuration in
`RowDataParquetBuilder`.
Currently `RowDataParquetBuilder` does not forward the following Parquet
config keys to
the writer:
- `parquet.statistics.truncate.length`
- `parquet.columnindex.truncate.length`
- `parquet.page.size.row.check.min`
- `parquet.page.size.row.check.max`
Without these, users cannot tune Parquet page-size checking behavior or
control the
truncation length of statistics and column indexes. This is especially
relevant for
tables with large records, where the default page-size check thresholds
can lead to
oversized pages.
Split out from #7621 per reviewer feedback, as this is an independent
enhancement.
#### Changes
- **`RowDataParquetBuilder`**: add `.withMinRowCountForPageSizeCheck()`,
`.withMaxRowCountForPageSizeCheck()`, `.withStatisticsTruncateLength()`,
`.withColumnIndexTruncateLength()` to the builder chain, reading from the
existing
Hadoop `Configuration` with Parquet's default values as fallback.
### Tests
Existing Parquet write tests cover the default config path. The new keys
follow the same
pattern as other builder options (e.g. `withDictionaryPageSize`,
`withPageSize`) and
use Parquet's built-in defaults when not set.
### API and Format
N/A — no public API or format changes. Users can now set these keys via
table
properties, using the same mechanism as other Parquet config keys.
### Documentation
N/A
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]