Re: [DISCUSS][APE] Introduce PageZero writer for batches with Sparse columns

Wail Alkowaileet Fri, 13 Jun 2025 03:24:42 -0700

+1

On Fri, Jun 13, 2025 at 3:31 AM Mike Carey <dtab...@gmail.com> wrote:


> +1
>
> This APE is super important since JSON lets users do "stupid" things
> like have one of the pieces of information in the objects of a
> collection have a monotonically increasing name, e.g., using a timestamp
> as a key and then an observation (e.g., temperature) as the associated
> value.  :-)  Such a collection will have a never-ending increasing set
> of "columns" (name-wise) that are each used just once.  Ouch!
>
> On 6/12/25 8:06 AM, Ian Maxon wrote:
> > +1, this APE is really cool and is a great solution to tricky
> > situations like objects with generated field names.
> >
> > On Thu, Jun 12, 2025 at 2:26 AM Ritik Raj<ri...@apache.org> wrote:
> >> During data ingestion or upsert operations, documents are flushed to
> disk
> >> in batches, creating disk components. In the columnar storage format,
> each
> >> MegaPage, which represents a leaf logically, begins with a single page
> >> metadata section called `PageZero`.
> >>
> >> Currently, `PageZero` stores metadata for every column in the global
> >> schema, even if a column is not present in the documents of the current
> >> batch. This metadata includes a 4-byte offset and a 16-byte filter
> (min/max
> >> values) per column. This approach leads to significant overhead,
> especially
> >> for datasets with sparse or wide schemas. The 128KB default size limit
> of
> >> `PageZero` imposes a practical maximum of approximately 6,500 columns,
> >> which is further reduced in practice by the space required for primary
> keys.
> >>
> >> The proposed enhancement introduces an efficient "Sparse PageZero
> writer".
> >> This writer's design is to only store metadata for the subset of columns
> >> that are actually present in the current batch of documents being
> flushed,
> >> plus any others required for correct column assembly (e.g., in union
> types
> >> or nested structures). This reduces metadata overhead, enabling support
> for
> >> schemas with a larger number of sparse columns within the existing
> >> `PageZero` size constraint.
> >>
> >> Risks and trade-offs include a potential performance impact. The sparse
> >> format requires PageReaders to perform a binary search to look up column
> >> offsets and filters, rather than a direct index lookup, which introduces
> >> CPU overhead. There is also a minor computational overhead from the
> column
> >> estimation logic.
> >>
> >> An alternative is the existing "Default" writer. The proposal includes
> an
> >> "Adaptive" mode that dynamically evaluates both the Default and Sparse
> >> writers for an incoming batch and selects the one that consumes the
> least
> >> space.
> >>
> >> A limitation of this proposal is that the `PageZero` size remains
> >> constrained to one page, default 128KB, so the hard limit on the number
> of
> >> columns in a single MegaPage remains ~6,500 by default. This is not
> removed
> >> by this change.
> >>
> >> This APE[
> >>
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/pages/viewpage.action?spaceKey=ASTERIXDB&title=APE*22*3A*Sparse*column*metadata*storage__;KyUrKysr!!CzAuKJ42GuquVTTmVmPViYEvSg!LYuLE2dTTaSmQqZFBtmfGfDeQ6wK34-3HPl68Ji6x8BUouvD48jjKQcklfT4D9zoHj615BA1DUAR$
> ]
> >> would introduce a new "Sparse PageZero writer" that writes metadata for
> >> only the subset of columns present in a given batch.
> >>
> >> The source code changes are summarized as follows:
> >> *   A new `PageZero Writer Mode` configuration option will be added with
> >> three possible values:
> >>      *   "*Default*": Always uses the current writer.
> >>      *   "*Sparse*": Always uses the new sparse writer.
> >>      *   "*Adaptive*": Dynamically compares the space usage of both
> writers
> >> for an incoming batch and selects the one that results in a smaller
> >> `PageZero`.
> >> *   The sparse layout will store `columnIndex`, `columnOffset`, and
> >> `columnFilter` for each present column.
> >> *   Logic will be added to determine the minimum required set of columns
> >> for a batch, accounting for schema evolution, unions, and nested
> structures
> >> to ensure correct record assembly.
> >>
> >> The change is controlled by a new configuration option. Existing disk
> >> components created with the default writer will coexist with new
> >> components. Since the global metadata is maintained at the index level
> and
> >> used by the column assembler to reconstruct records, the system will be
> >> able to read from components created with either writer, ensuring
> backward
> >> compatibility.
> >>
> >> The following areas will be tested to validate the change:
> >>
> >> *Performance Testing*:
> >> Once a prototype is available, performance testing should be done to
> >> evaluate the trade-offs:
> >> 1.  *Indirect Column Lookup*: Measure the CPU overhead introduced by
> using
> >> binary search to locate column offsets and filters.
> >> 2.  *Column Estimation Overhead*: Measure the computational cost of the
> >> column estimation step.
> >>
> >> *Functional Testing*:
> >> 1.  *Default Writer Validation*: Run the existing test suite with
> documents
> >> containing most or all fields to ensure the default writer's behavior is
> >> unchanged.
> >> 2.  *Sparse Writer Validation*: Design a new test suite with batches of
> >> sparse documents (where each batch contains a subset of fields) to
> verify
> >> that the `SparsePageZeroWriter` produces smaller disk components. Tests
> >> will be constructed with a column set less than or equal to the 6,500
> >> column limit.
> >> 3.  *Correctness Checks*: For both writers, compare query results with
> row
> >> format collections to ensure correctness, paying special attention to
> >> missing fields, null values, and nested structures (arrays, objects,
> >> unions).



-- 

*Regards,*
Wail Alkowaileet

Re: [DISCUSS][APE] Introduce PageZero writer for batches with Sparse columns

Reply via email to