During data ingestion or upsert operations, documents are flushed to disk
in batches, creating disk components. In the columnar storage format, each
MegaPage, which represents a leaf logically, begins with a single page
metadata section called `PageZero`. Currently, `PageZero` stores metadata
for every column in the global schema, even if a column is not present in
the documents of the current batch. This metadata includes a 4-byte offset
and a 16-byte filter (min/max values) per column. This approach leads to
significant overhead, especially for datasets with sparse or wide schemas.
The 128KB default size limit of `PageZero` imposes a practical maximum of
approximately 6,500 columns, which is further reduced in practice by the
space required for primary keys. The proposed enhancement introduces an
efficient "Sparse PageZero writer". This writer's design is to only store
metadata for the subset of columns that are actually present in the current
batch of documents being flushed, plus any others required for correct
column assembly (e.g., in union types or nested structures). This reduces
metadata overhead, enabling support for schemas with a larger number of
sparse columns within the existing `PageZero` size constraint. Risks and
trade-offs include a potential performance impact. The sparse format
requires PageReaders to perform a binary search to look up column offsets
and filters, rather than a direct index lookup, which introduces CPU
overhead. There is also a minor computational overhead from the column
estimation logic. An alternative is the existing "Default" writer. The
proposal includes an "Adaptive" mode that dynamically evaluates both the
Default and Sparse writers for an incoming batch and selects the one that
consumes the least space. A limitation of this proposal is that the
`PageZero` size remains constrained to one page, default 128KB, so the hard
limit on the number of columns in a single MegaPage remains ~6,500 by
default. This is not removed by this change. This APE
<https://cwiki.apache.org/confluence/pages/viewpage.action?spaceKey=ASTERIXDB&title=APE+22%3A+Sparse+column+metadata+storage>
would introduce a new "Sparse PageZero writer" that writes metadata for
only the subset of columns present in a given batch. The source code
changes are summarized as follows: * A new `PageZero Writer Mode`
configuration option will be added with three possible values: * "*Default*":
Always uses the current writer. * "*Sparse*": Always uses the new sparse
writer. * "*Adaptive*": Dynamically compares the space usage of both
writers for an incoming batch and selects the one that results in a smaller
`PageZero`. * The sparse layout will store `columnIndex`, `columnOffset`,
and `columnFilter` for each present column. * Logic will be added to
determine the minimum required set of columns for a batch, accounting for
schema evolution, unions, and nested structures to ensure correct record
assembly. The change is controlled by a new configuration option. Existing
disk components created with the default writer will coexist with new
components. Since the global metadata is maintained at the index level and
used by the column assembler to reconstruct records, the system will be
able to read from components created with either writer, ensuring backward
compatibility. The following areas will be tested to validate the
change: *Performance
Testing:* Once a prototype is available, performance testing should be done
to evaluate the trade-offs: 1. *Indirect Column Lookup*: Measure the CPU
overhead introduced by using binary search to locate column offsets and
filters. 2. *Column Estimation Overhead*: Measure the computational cost of
the column estimation step. Functional Testing: 1. *Default Writer
Validation*: Run the existing test suite with documents containing most or
all fields to ensure the default writer's behavior is unchanged. 2. *Sparse
Writer Validation*: Design a new test suite with batches of sparse
documents (where each batch contains a subset of fields) to verify that the
`SparsePageZeroWriter` produces smaller disk components. Tests will be
constructed with a column set less than or equal to the 6,500 column limit.
3. *Correctness Checks*: For both writers, compare query results with row
format collections to ensure correctness, paying special attention to
missing fields, null values, and nested structures (arrays, objects,
unions).