+1, this APE is really cool and is a great solution to tricky
situations like objects with generated field names.

On Thu, Jun 12, 2025 at 2:26 AM Ritik Raj <ri...@apache.org> wrote:
>
> During data ingestion or upsert operations, documents are flushed to disk
> in batches, creating disk components. In the columnar storage format, each
> MegaPage, which represents a leaf logically, begins with a single page
> metadata section called `PageZero`.
>
> Currently, `PageZero` stores metadata for every column in the global
> schema, even if a column is not present in the documents of the current
> batch. This metadata includes a 4-byte offset and a 16-byte filter (min/max
> values) per column. This approach leads to significant overhead, especially
> for datasets with sparse or wide schemas. The 128KB default size limit of
> `PageZero` imposes a practical maximum of approximately 6,500 columns,
> which is further reduced in practice by the space required for primary keys.
>
> The proposed enhancement introduces an efficient "Sparse PageZero writer".
> This writer's design is to only store metadata for the subset of columns
> that are actually present in the current batch of documents being flushed,
> plus any others required for correct column assembly (e.g., in union types
> or nested structures). This reduces metadata overhead, enabling support for
> schemas with a larger number of sparse columns within the existing
> `PageZero` size constraint.
>
> Risks and trade-offs include a potential performance impact. The sparse
> format requires PageReaders to perform a binary search to look up column
> offsets and filters, rather than a direct index lookup, which introduces
> CPU overhead. There is also a minor computational overhead from the column
> estimation logic.
>
> An alternative is the existing "Default" writer. The proposal includes an
> "Adaptive" mode that dynamically evaluates both the Default and Sparse
> writers for an incoming batch and selects the one that consumes the least
> space.
>
> A limitation of this proposal is that the `PageZero` size remains
> constrained to one page, default 128KB, so the hard limit on the number of
> columns in a single MegaPage remains ~6,500 by default. This is not removed
> by this change.
>
> This APE[
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/pages/viewpage.action?spaceKey=ASTERIXDB&title=APE*22*3A*Sparse*column*metadata*storage__;KyUrKysr!!CzAuKJ42GuquVTTmVmPViYEvSg!LYuLE2dTTaSmQqZFBtmfGfDeQ6wK34-3HPl68Ji6x8BUouvD48jjKQcklfT4D9zoHj615BA1DUAR$
>  ]
> would introduce a new "Sparse PageZero writer" that writes metadata for
> only the subset of columns present in a given batch.
>
> The source code changes are summarized as follows:
> *   A new `PageZero Writer Mode` configuration option will be added with
> three possible values:
>     *   "*Default*": Always uses the current writer.
>     *   "*Sparse*": Always uses the new sparse writer.
>     *   "*Adaptive*": Dynamically compares the space usage of both writers
> for an incoming batch and selects the one that results in a smaller
> `PageZero`.
> *   The sparse layout will store `columnIndex`, `columnOffset`, and
> `columnFilter` for each present column.
> *   Logic will be added to determine the minimum required set of columns
> for a batch, accounting for schema evolution, unions, and nested structures
> to ensure correct record assembly.
>
> The change is controlled by a new configuration option. Existing disk
> components created with the default writer will coexist with new
> components. Since the global metadata is maintained at the index level and
> used by the column assembler to reconstruct records, the system will be
> able to read from components created with either writer, ensuring backward
> compatibility.
>
> The following areas will be tested to validate the change:
>
> *Performance Testing*:
> Once a prototype is available, performance testing should be done to
> evaluate the trade-offs:
> 1.  *Indirect Column Lookup*: Measure the CPU overhead introduced by using
> binary search to locate column offsets and filters.
> 2.  *Column Estimation Overhead*: Measure the computational cost of the
> column estimation step.
>
> *Functional Testing*:
> 1.  *Default Writer Validation*: Run the existing test suite with documents
> containing most or all fields to ensure the default writer's behavior is
> unchanged.
> 2.  *Sparse Writer Validation*: Design a new test suite with batches of
> sparse documents (where each batch contains a subset of fields) to verify
> that the `SparsePageZeroWriter` produces smaller disk components. Tests
> will be constructed with a column set less than or equal to the 6,500
> column limit.
> 3.  *Correctness Checks*: For both writers, compare query results with row
> format collections to ensure correctness, paying special attention to
> missing fields, null values, and nested structures (arrays, objects,
> unions).

Reply via email to