During data ingestion or upsert operations, documents are flushed to disk in batches, creating disk components. In the columnar storage format, each MegaPage, which represents a leaf logically, begins with a single page metadata section called `PageZero`.
Currently, `PageZero` stores metadata for every column in the global schema, even if a column is not present in the documents of the current batch. This metadata includes a 4-byte offset and a 16-byte filter (min/max values) per column. This approach leads to significant overhead, especially for datasets with sparse or wide schemas. The 128KB default size limit of `PageZero` imposes a practical maximum of approximately 6,500 columns, which is further reduced in practice by the space required for primary keys. The proposed enhancement introduces an efficient "Sparse PageZero writer". This writer's design is to only store metadata for the subset of columns that are actually present in the current batch of documents being flushed, plus any others required for correct column assembly (e.g., in union types or nested structures). This reduces metadata overhead, enabling support for schemas with a larger number of sparse columns within the existing `PageZero` size constraint. Risks and trade-offs include a potential performance impact. The sparse format requires PageReaders to perform a binary search to look up column offsets and filters, rather than a direct index lookup, which introduces CPU overhead. There is also a minor computational overhead from the column estimation logic. An alternative is the existing "Default" writer. The proposal includes an "Adaptive" mode that dynamically evaluates both the Default and Sparse writers for an incoming batch and selects the one that consumes the least space. A limitation of this proposal is that the `PageZero` size remains constrained to one page, default 128KB, so the hard limit on the number of columns in a single MegaPage remains ~6,500 by default. This is not removed by this change. This APE[ https://cwiki.apache.org/confluence/pages/viewpage.action?spaceKey=ASTERIXDB&title=APE+22%3A+Sparse+column+metadata+storage] would introduce a new "Sparse PageZero writer" that writes metadata for only the subset of columns present in a given batch. The source code changes are summarized as follows: * A new `PageZero Writer Mode` configuration option will be added with three possible values: * "*Default*": Always uses the current writer. * "*Sparse*": Always uses the new sparse writer. * "*Adaptive*": Dynamically compares the space usage of both writers for an incoming batch and selects the one that results in a smaller `PageZero`. * The sparse layout will store `columnIndex`, `columnOffset`, and `columnFilter` for each present column. * Logic will be added to determine the minimum required set of columns for a batch, accounting for schema evolution, unions, and nested structures to ensure correct record assembly. The change is controlled by a new configuration option. Existing disk components created with the default writer will coexist with new components. Since the global metadata is maintained at the index level and used by the column assembler to reconstruct records, the system will be able to read from components created with either writer, ensuring backward compatibility. The following areas will be tested to validate the change: *Performance Testing*: Once a prototype is available, performance testing should be done to evaluate the trade-offs: 1. *Indirect Column Lookup*: Measure the CPU overhead introduced by using binary search to locate column offsets and filters. 2. *Column Estimation Overhead*: Measure the computational cost of the column estimation step. *Functional Testing*: 1. *Default Writer Validation*: Run the existing test suite with documents containing most or all fields to ensure the default writer's behavior is unchanged. 2. *Sparse Writer Validation*: Design a new test suite with batches of sparse documents (where each batch contains a subset of fields) to verify that the `SparsePageZeroWriter` produces smaller disk components. Tests will be constructed with a column set less than or equal to the 6,500 column limit. 3. *Correctness Checks*: For both writers, compare query results with row format collections to ensure correctness, paying special attention to missing fields, null values, and nested structures (arrays, objects, unions).
