Hi folks,

The recent proposal about disjoint column chunks is great but it made me
wonder how mainstream engines today cope with encountering future features
in files.

I've made a quick script - nothing stable or portable or shareable, sorry -
to see where we are:

[image: image.png]
# Reader Compatibility

![Compatibility Matrix](matrix.png)

Tests from the current compatibility matrix, grouped by test. Each section
covers what the engines listed there do; engines not mentioned handled the
test correctly. The expected reader behaviour for every test (unless noted
otherwise) is: reject reads of the affected column with a controlled error,
while still allowing reads of the unaffected column.

## `ColumnMetaData.data_page_offset = -1` (dict-encoded)

The affected column is dictionary-encoded, but `data_page_offset` is set to
`-1` while `dictionary_page_offset` remains valid. The test is motivated by
the mailing-list discussion of using `-1` as an escape-hatch sentinel for
disjoint column chunks; `-1` is currently invalid, so the question is
whether
readers treat it as a signal or silently read through it.

- `arrow-rs` and `parquet-go` use `dictionary_page_offset` as the chunk
start
  when a dictionary is present, never consulting `data_page_offset`. The
  mutated `-1` is silently ignored.
- `duckdb` (1.5.2) takes the unsigned-min across the chunk's offsets, so
`-1`
  silently loses to the dictionary offset. More recently, a negative-value
  guard was added; the next release will flip this cell green.
- `bigquery` also reads the column fine, presumably via a similar
  dict-preference mechanism.

## `ColumnMetaData.data_page_offset = -1` (no dictionary)

Same mutation, but the column has no dictionary, so there's no positive
offset for a `min`-based or "prefer-dictionary" trick to fall back on.

- `arrow-rs` panics in its chunk-range helper on the assertion that column
  start must not be negative.

## `PageHeader.compressed_page_size = -1`

The first data page of the affected column has `compressed_page_size = -1`.

- `fastparquet` passes `compressed_page_size` unchecked to `NumpyIO.read()`,
  which on `-1` returns the rest of the buffer. The over-read then flows
  into decompression (or directly into the page parser for uncompressed
  columns) and may surface as either corrupt data or a decompression error,
  depending on the codec and data shape.
- `parquet-go` passes `CompressedPageSize` unvalidated to a slice operation
  `b.data[:size]`; `size = -1` triggers Go's "slice bounds out of range
  [:-1]" panic.
- `arrow-rs` (53.4.1) casts `compressed_page_size: i32` to `usize`,
  sign-extending `-1` to `u64::MAX`, then `Vec::with_capacity(u64::MAX)`
  panics with capacity overflow. Fixed in 54.1.0 by adding a
  `verify_page_size` guard.

## `ColumnMetaData.encodings` and `DataPageHeader.encoding`

The affected column advertises an unknown encoding both in the footer and in
the first data page header.

## `PageHeader.type`

The first data page of the affected column has an unknown `PageHeader.type`,
simulating the introduction of a hypothetical `DataPageV3`.

There is a real format-history tension here. The parquet README says
additional page types can be safely skipped, which is fine for non-data
extensions like `INDEX_PAGE`. But that contract was already strained when
`DATA_PAGE_V2` landed: V2 pages carry data, so a reader that silently
skipped
them under the "safe to skip" rule would lose rows. Any future data-bearing
page format inherits the same problem — on an unknown page type a reader has
to choose between silently skipping (and risking lost data) or rejecting
(and blocking forward-compat reads), and neither is unambiguously correct
from the spec.

- `fastparquet` only special-cases dictionary pages and `DATA_PAGE_V2`; any
  other page type falls through to the V1 path and is decoded as if it were
  a V1 data page.
- `arrow-rs` (53.4.1) panics: `decode_page`'s fallback arm calls
  `unimplemented!()` for unknown page types. More recently (57.1.0) the
  panic was replaced with a mix of a clean error and a "skip unknown page
  and continue" path. The skip path is itself the same silent-data-loss
  problem as `arrow-cpp` below, so the cell stays red.
- `arrow-cpp` (via pyarrow) treats unknown page types as skippable and drops
  the affected data pages. The rows carried by those pages disappear from
  the output, which can come back truncated; a downstream full-table read
  may then fail because column lengths no longer match.

## `ColumnMetaData.codec`

The affected column advertises an unknown compression codec in the footer.

## `SchemaElement.logicalType`

The affected column's `logicalType` carries an unknown union arm. The spec
treats new logical types as forward-compatible: readers should continue to
read the physical values with "loss of semantic meaning" rather than reject
the file.

- `fastparquet` has a general compact-thrift parsing bug for higher field
  ids: long-form field ids are misparsed in a byte-pattern-dependent way.
  This isn't logical-type-specific, but `LogicalType` is one of the most
  likely places for it to manifest because the union keeps gaining arms.
  Already failing on existing arms like `VARIANT` and `GEOMETRY`.
- `arrow-go` panics with "invalid logical type" in `getLogicalType`'s
  default arm. The switch covers everything up to VARIANT but not GEOMETRY
  (id=17) or GEOGRAPHY (id=18), both in the current spec, so the panic
  isn't limited to hypothetical-future cases — any GEOMETRY- or
  GEOGRAPHY-annotated file would crash arrow-go the same way. Still
  missing at HEAD.

## `SchemaElement.type`

The affected column's physical `type` is set to an unknown enum value.

- `parquet-go` silently substitutes `nullType` for the column. Its
  physical-type dispatch covers BOOLEAN, INT32, INT64, INT96, FLOAT,
  DOUBLE, BYTE_ARRAY, and FIXED_LEN_BYTE_ARRAY; anything else falls
  through to `nullType`. The substitution doesn't check `repetition_type`,
  so a REQUIRED column gets the same treatment even though the schema
  says it can't have nulls. No error surfaces; downstream consumers see
  a column with `Kind() == -1` and zero length.

## `FileMetaData.column_orders`

The file footer carries an unknown `column_orders` union arm for the
affected column. This is forward-compat metadata that affects how
`min_value` / `max_value` should be interpreted for stat-based pruning; the
spec says its meaning is "undefined" (not invalid) without a recognised
arm, so readers shouldn't refuse the file.

## `FileMetaData.version`

The `version` field of FileMetaData is set to an unknown value. The spec
itself notes that writers should always populate `1` and readers should
accept `1` and `2` interchangeably; other values are "reserved for future
use-cases" with no prescribed reader behaviour. In practice every engine in
the matrix ignores the field entirely — the field is functionally dormant,
and format extensions are detected by structural metadata rather than a
version bump.


Hope it's useful,
Will

Reply via email to