Hi folks, The recent proposal about disjoint column chunks is great but it made me wonder how mainstream engines today cope with encountering future features in files.
I've made a quick script - nothing stable or portable or shareable, sorry - to see where we are: [image: image.png] # Reader Compatibility  Tests from the current compatibility matrix, grouped by test. Each section covers what the engines listed there do; engines not mentioned handled the test correctly. The expected reader behaviour for every test (unless noted otherwise) is: reject reads of the affected column with a controlled error, while still allowing reads of the unaffected column. ## `ColumnMetaData.data_page_offset = -1` (dict-encoded) The affected column is dictionary-encoded, but `data_page_offset` is set to `-1` while `dictionary_page_offset` remains valid. The test is motivated by the mailing-list discussion of using `-1` as an escape-hatch sentinel for disjoint column chunks; `-1` is currently invalid, so the question is whether readers treat it as a signal or silently read through it. - `arrow-rs` and `parquet-go` use `dictionary_page_offset` as the chunk start when a dictionary is present, never consulting `data_page_offset`. The mutated `-1` is silently ignored. - `duckdb` (1.5.2) takes the unsigned-min across the chunk's offsets, so `-1` silently loses to the dictionary offset. More recently, a negative-value guard was added; the next release will flip this cell green. - `bigquery` also reads the column fine, presumably via a similar dict-preference mechanism. ## `ColumnMetaData.data_page_offset = -1` (no dictionary) Same mutation, but the column has no dictionary, so there's no positive offset for a `min`-based or "prefer-dictionary" trick to fall back on. - `arrow-rs` panics in its chunk-range helper on the assertion that column start must not be negative. ## `PageHeader.compressed_page_size = -1` The first data page of the affected column has `compressed_page_size = -1`. - `fastparquet` passes `compressed_page_size` unchecked to `NumpyIO.read()`, which on `-1` returns the rest of the buffer. The over-read then flows into decompression (or directly into the page parser for uncompressed columns) and may surface as either corrupt data or a decompression error, depending on the codec and data shape. - `parquet-go` passes `CompressedPageSize` unvalidated to a slice operation `b.data[:size]`; `size = -1` triggers Go's "slice bounds out of range [:-1]" panic. - `arrow-rs` (53.4.1) casts `compressed_page_size: i32` to `usize`, sign-extending `-1` to `u64::MAX`, then `Vec::with_capacity(u64::MAX)` panics with capacity overflow. Fixed in 54.1.0 by adding a `verify_page_size` guard. ## `ColumnMetaData.encodings` and `DataPageHeader.encoding` The affected column advertises an unknown encoding both in the footer and in the first data page header. ## `PageHeader.type` The first data page of the affected column has an unknown `PageHeader.type`, simulating the introduction of a hypothetical `DataPageV3`. There is a real format-history tension here. The parquet README says additional page types can be safely skipped, which is fine for non-data extensions like `INDEX_PAGE`. But that contract was already strained when `DATA_PAGE_V2` landed: V2 pages carry data, so a reader that silently skipped them under the "safe to skip" rule would lose rows. Any future data-bearing page format inherits the same problem — on an unknown page type a reader has to choose between silently skipping (and risking lost data) or rejecting (and blocking forward-compat reads), and neither is unambiguously correct from the spec. - `fastparquet` only special-cases dictionary pages and `DATA_PAGE_V2`; any other page type falls through to the V1 path and is decoded as if it were a V1 data page. - `arrow-rs` (53.4.1) panics: `decode_page`'s fallback arm calls `unimplemented!()` for unknown page types. More recently (57.1.0) the panic was replaced with a mix of a clean error and a "skip unknown page and continue" path. The skip path is itself the same silent-data-loss problem as `arrow-cpp` below, so the cell stays red. - `arrow-cpp` (via pyarrow) treats unknown page types as skippable and drops the affected data pages. The rows carried by those pages disappear from the output, which can come back truncated; a downstream full-table read may then fail because column lengths no longer match. ## `ColumnMetaData.codec` The affected column advertises an unknown compression codec in the footer. ## `SchemaElement.logicalType` The affected column's `logicalType` carries an unknown union arm. The spec treats new logical types as forward-compatible: readers should continue to read the physical values with "loss of semantic meaning" rather than reject the file. - `fastparquet` has a general compact-thrift parsing bug for higher field ids: long-form field ids are misparsed in a byte-pattern-dependent way. This isn't logical-type-specific, but `LogicalType` is one of the most likely places for it to manifest because the union keeps gaining arms. Already failing on existing arms like `VARIANT` and `GEOMETRY`. - `arrow-go` panics with "invalid logical type" in `getLogicalType`'s default arm. The switch covers everything up to VARIANT but not GEOMETRY (id=17) or GEOGRAPHY (id=18), both in the current spec, so the panic isn't limited to hypothetical-future cases — any GEOMETRY- or GEOGRAPHY-annotated file would crash arrow-go the same way. Still missing at HEAD. ## `SchemaElement.type` The affected column's physical `type` is set to an unknown enum value. - `parquet-go` silently substitutes `nullType` for the column. Its physical-type dispatch covers BOOLEAN, INT32, INT64, INT96, FLOAT, DOUBLE, BYTE_ARRAY, and FIXED_LEN_BYTE_ARRAY; anything else falls through to `nullType`. The substitution doesn't check `repetition_type`, so a REQUIRED column gets the same treatment even though the schema says it can't have nulls. No error surfaces; downstream consumers see a column with `Kind() == -1` and zero length. ## `FileMetaData.column_orders` The file footer carries an unknown `column_orders` union arm for the affected column. This is forward-compat metadata that affects how `min_value` / `max_value` should be interpreted for stat-based pruning; the spec says its meaning is "undefined" (not invalid) without a recognised arm, so readers shouldn't refuse the file. ## `FileMetaData.version` The `version` field of FileMetaData is set to an unknown value. The spec itself notes that writers should always populate `1` and readers should accept `1` and `2` interchangeably; other values are "reserved for future use-cases" with no prescribed reader behaviour. In practice every engine in the matrix ignores the field entirely — the field is functionally dormant, and format extensions are detected by structural metadata rather than a version bump. Hope it's useful, Will
