+1, this APE is really cool and is a great solution to tricky situations like objects with generated field names.
On Thu, Jun 12, 2025 at 2:26 AM Ritik Raj <ri...@apache.org> wrote: > > During data ingestion or upsert operations, documents are flushed to disk > in batches, creating disk components. In the columnar storage format, each > MegaPage, which represents a leaf logically, begins with a single page > metadata section called `PageZero`. > > Currently, `PageZero` stores metadata for every column in the global > schema, even if a column is not present in the documents of the current > batch. This metadata includes a 4-byte offset and a 16-byte filter (min/max > values) per column. This approach leads to significant overhead, especially > for datasets with sparse or wide schemas. The 128KB default size limit of > `PageZero` imposes a practical maximum of approximately 6,500 columns, > which is further reduced in practice by the space required for primary keys. > > The proposed enhancement introduces an efficient "Sparse PageZero writer". > This writer's design is to only store metadata for the subset of columns > that are actually present in the current batch of documents being flushed, > plus any others required for correct column assembly (e.g., in union types > or nested structures). This reduces metadata overhead, enabling support for > schemas with a larger number of sparse columns within the existing > `PageZero` size constraint. > > Risks and trade-offs include a potential performance impact. The sparse > format requires PageReaders to perform a binary search to look up column > offsets and filters, rather than a direct index lookup, which introduces > CPU overhead. There is also a minor computational overhead from the column > estimation logic. > > An alternative is the existing "Default" writer. The proposal includes an > "Adaptive" mode that dynamically evaluates both the Default and Sparse > writers for an incoming batch and selects the one that consumes the least > space. > > A limitation of this proposal is that the `PageZero` size remains > constrained to one page, default 128KB, so the hard limit on the number of > columns in a single MegaPage remains ~6,500 by default. This is not removed > by this change. > > This APE[ > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/pages/viewpage.action?spaceKey=ASTERIXDB&title=APE*22*3A*Sparse*column*metadata*storage__;KyUrKysr!!CzAuKJ42GuquVTTmVmPViYEvSg!LYuLE2dTTaSmQqZFBtmfGfDeQ6wK34-3HPl68Ji6x8BUouvD48jjKQcklfT4D9zoHj615BA1DUAR$ > ] > would introduce a new "Sparse PageZero writer" that writes metadata for > only the subset of columns present in a given batch. > > The source code changes are summarized as follows: > * A new `PageZero Writer Mode` configuration option will be added with > three possible values: > * "*Default*": Always uses the current writer. > * "*Sparse*": Always uses the new sparse writer. > * "*Adaptive*": Dynamically compares the space usage of both writers > for an incoming batch and selects the one that results in a smaller > `PageZero`. > * The sparse layout will store `columnIndex`, `columnOffset`, and > `columnFilter` for each present column. > * Logic will be added to determine the minimum required set of columns > for a batch, accounting for schema evolution, unions, and nested structures > to ensure correct record assembly. > > The change is controlled by a new configuration option. Existing disk > components created with the default writer will coexist with new > components. Since the global metadata is maintained at the index level and > used by the column assembler to reconstruct records, the system will be > able to read from components created with either writer, ensuring backward > compatibility. > > The following areas will be tested to validate the change: > > *Performance Testing*: > Once a prototype is available, performance testing should be done to > evaluate the trade-offs: > 1. *Indirect Column Lookup*: Measure the CPU overhead introduced by using > binary search to locate column offsets and filters. > 2. *Column Estimation Overhead*: Measure the computational cost of the > column estimation step. > > *Functional Testing*: > 1. *Default Writer Validation*: Run the existing test suite with documents > containing most or all fields to ensure the default writer's behavior is > unchanged. > 2. *Sparse Writer Validation*: Design a new test suite with batches of > sparse documents (where each batch contains a subset of fields) to verify > that the `SparsePageZeroWriter` produces smaller disk components. Tests > will be constructed with a column set less than or equal to the 6,500 > column limit. > 3. *Correctness Checks*: For both writers, compare query results with row > format collections to ensure correctness, paying special attention to > missing fields, null values, and nested structures (arrays, objects, > unions).