+1 On Fri, Jun 13, 2025 at 3:31 AM Mike Carey <dtab...@gmail.com> wrote:
> +1 > > This APE is super important since JSON lets users do "stupid" things > like have one of the pieces of information in the objects of a > collection have a monotonically increasing name, e.g., using a timestamp > as a key and then an observation (e.g., temperature) as the associated > value. :-) Such a collection will have a never-ending increasing set > of "columns" (name-wise) that are each used just once. Ouch! > > On 6/12/25 8:06 AM, Ian Maxon wrote: > > +1, this APE is really cool and is a great solution to tricky > > situations like objects with generated field names. > > > > On Thu, Jun 12, 2025 at 2:26 AM Ritik Raj<ri...@apache.org> wrote: > >> During data ingestion or upsert operations, documents are flushed to > disk > >> in batches, creating disk components. In the columnar storage format, > each > >> MegaPage, which represents a leaf logically, begins with a single page > >> metadata section called `PageZero`. > >> > >> Currently, `PageZero` stores metadata for every column in the global > >> schema, even if a column is not present in the documents of the current > >> batch. This metadata includes a 4-byte offset and a 16-byte filter > (min/max > >> values) per column. This approach leads to significant overhead, > especially > >> for datasets with sparse or wide schemas. The 128KB default size limit > of > >> `PageZero` imposes a practical maximum of approximately 6,500 columns, > >> which is further reduced in practice by the space required for primary > keys. > >> > >> The proposed enhancement introduces an efficient "Sparse PageZero > writer". > >> This writer's design is to only store metadata for the subset of columns > >> that are actually present in the current batch of documents being > flushed, > >> plus any others required for correct column assembly (e.g., in union > types > >> or nested structures). This reduces metadata overhead, enabling support > for > >> schemas with a larger number of sparse columns within the existing > >> `PageZero` size constraint. > >> > >> Risks and trade-offs include a potential performance impact. The sparse > >> format requires PageReaders to perform a binary search to look up column > >> offsets and filters, rather than a direct index lookup, which introduces > >> CPU overhead. There is also a minor computational overhead from the > column > >> estimation logic. > >> > >> An alternative is the existing "Default" writer. The proposal includes > an > >> "Adaptive" mode that dynamically evaluates both the Default and Sparse > >> writers for an incoming batch and selects the one that consumes the > least > >> space. > >> > >> A limitation of this proposal is that the `PageZero` size remains > >> constrained to one page, default 128KB, so the hard limit on the number > of > >> columns in a single MegaPage remains ~6,500 by default. This is not > removed > >> by this change. > >> > >> This APE[ > >> > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/pages/viewpage.action?spaceKey=ASTERIXDB&title=APE*22*3A*Sparse*column*metadata*storage__;KyUrKysr!!CzAuKJ42GuquVTTmVmPViYEvSg!LYuLE2dTTaSmQqZFBtmfGfDeQ6wK34-3HPl68Ji6x8BUouvD48jjKQcklfT4D9zoHj615BA1DUAR$ > ] > >> would introduce a new "Sparse PageZero writer" that writes metadata for > >> only the subset of columns present in a given batch. > >> > >> The source code changes are summarized as follows: > >> * A new `PageZero Writer Mode` configuration option will be added with > >> three possible values: > >> * "*Default*": Always uses the current writer. > >> * "*Sparse*": Always uses the new sparse writer. > >> * "*Adaptive*": Dynamically compares the space usage of both > writers > >> for an incoming batch and selects the one that results in a smaller > >> `PageZero`. > >> * The sparse layout will store `columnIndex`, `columnOffset`, and > >> `columnFilter` for each present column. > >> * Logic will be added to determine the minimum required set of columns > >> for a batch, accounting for schema evolution, unions, and nested > structures > >> to ensure correct record assembly. > >> > >> The change is controlled by a new configuration option. Existing disk > >> components created with the default writer will coexist with new > >> components. Since the global metadata is maintained at the index level > and > >> used by the column assembler to reconstruct records, the system will be > >> able to read from components created with either writer, ensuring > backward > >> compatibility. > >> > >> The following areas will be tested to validate the change: > >> > >> *Performance Testing*: > >> Once a prototype is available, performance testing should be done to > >> evaluate the trade-offs: > >> 1. *Indirect Column Lookup*: Measure the CPU overhead introduced by > using > >> binary search to locate column offsets and filters. > >> 2. *Column Estimation Overhead*: Measure the computational cost of the > >> column estimation step. > >> > >> *Functional Testing*: > >> 1. *Default Writer Validation*: Run the existing test suite with > documents > >> containing most or all fields to ensure the default writer's behavior is > >> unchanged. > >> 2. *Sparse Writer Validation*: Design a new test suite with batches of > >> sparse documents (where each batch contains a subset of fields) to > verify > >> that the `SparsePageZeroWriter` produces smaller disk components. Tests > >> will be constructed with a column set less than or equal to the 6,500 > >> column limit. > >> 3. *Correctness Checks*: For both writers, compare query results with > row > >> format collections to ensure correctness, paying special attention to > >> missing fields, null values, and nested structures (arrays, objects, > >> unions). -- *Regards,* Wail Alkowaileet