+1

On Thu, Jun 26, 2025 at 3:08 PM Mike Carey <[email protected]> wrote:
>
> This looks excellent!
>
> +1 for adopting this extension to our storage system ASAP.
>
> On 6/26/25 11:53 AM, Ritik Raj wrote:
> > *Expanding PageZero to Support Unlimited Columns*
> > APE:
> > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/ASTERIXDB/APE*23*3A*Unlimited*Columns*Support__;KyUrKys!!CzAuKJ42GuquVTTmVmPViYEvSg!P1EHZSwq7hcpOlyHuy7R1F0lAkJK31elLGusrjb58xBVxuuNH4gxpVwKRuJSv9mByOtN5siVn5A6sQ$
> >
> > In the columnar storage format, each MegaPage represents a logical leaf
> > node and begins with `PageZero`, a metadata section that captures essential
> > column metadata including column offsets and min/max filters. Originally,
> > `PageZero` was constrained to reside in a single page (typically 128KB),
> > with a fixed layout that stored information for **every column** in the
> > global schema.
> >
> > Each column entry consumed 4 bytes for offset and 16 bytes for a min/max
> > filter, leading to a **metadata footprint of 20 bytes per column**. With
> > this layout, the **maximum number of columns supported was capped at
> > ~6,000**, given space constraints and the need to reserve part of
> > `PageZero` for primary key metadata and structural headers.
> >
> > This limitation became problematic for datasets with **wide or sparse
> > schemas**, where many columns may be missing in individual document batches
> > but still occupy space in `PageZero`. The presence of unused metadata
> > bloated the footprint and limited scalability.
> >
> > *Multi-Segment PageZero: Motivation and Layout*
> >
> > To overcome this limitation, we introduce **multi-segment support in
> > PageZero**. Instead of storing all metadata in a single fixed block, we
> > partition PageZero into multiple **segments**, with the **first (zeroth)
> > segment storing primary key metadata and as many column entries as it can
> > fit**, and subsequent segments storing the remaining metadata.
> >
> > Each segment follows the same layout: column index → offset → min → max,
> > stored in an interleaved manner. This structure ensures efficient scan and
> > lookup, while enabling us to scale to **arbitrarily many columns**, bounded
> > only by MegaPage size.
> >
> > *Segment Layout:*
> >
> > ```
> > [ Segment Header ]
> >   ├─ Number of Columns
> >   ├─ Max Column Index in Segment
> > [ Interleaved Metadata Entries ]
> >   ├─ ColumnIndex₁, Offset₁, Min₁, Max₁
> >   ├─ ColumnIndex₂, Offset₂, Min₂, Max₂
> >   └─ ...
> > ```
> >
> > A new `DefaultColumnMultiPageZeroWriter` class was introduced to manage
> > this segmented layout. It delegates metadata writing to individual segments
> > while maintaining headers at the top-level for navigation.
> >
> > *Adaptive Writer Selection*
> >
> > To avoid burdening all batches with this segmented structure, we retain the
> > `DefaultColumnPageZeroWriter` for small or dense schemas. A new **adaptive
> > selection mechanism** compares space usage of both writers for a batch and
> > picks the optimal one.
> >
> > The decision logic weighs:
> > - Space taken by Default Multi-segment writer (fixed layout for all columns)
> > - Space taken by Sparse Multi-Segment writer (compact layout for present
> > columns)
> >
> > This logic is encapsulated in `PageZeroWriterFlavorSelector`.
> >
> > *New Configuration Options:*
> >
> > Two new storage configuration parameters have been introduced:
> >
> > 1. **`STORAGE_MAX_COLUMNS_IN_ZEROTH_SEGMENT`** (`INTEGER_BYTE_UNIT`,
> > default: `5000`)
> >     Controls the maximum number of columns that can be stored in the zeroth
> > segment of `PageZero`. Remaining columns, if any, are offloaded to
> > additional segments. This helps balance lookup performance (fast for zeroth
> > segment) and scalability. This might change based on perf experiments.
> >
> > 2. **`STORAGE_PAGE_ZERO_WRITER`** (`STRING`, default: `"default"`)
> >     Controls the writer strategy used during flush. Accepted values are:
> >     - `"default"`: Always use the legacy writer.
> >     - `"sparse"`: Always use the sparse writer (only present columns).
> >     - `"adaptive"`: Dynamically compare both and pick the writer that uses
> > less space.
> >
> > *Summary of Changes*
> >
> > - Interleaved layout per segment for columnIndex, offset, min, max.
> > - Logic to estimate the number of segments and assign columns to segments.
> > - Writer is selected dynamically using `PageZeroWriterFlavorSelector`.
> >
> > *Benefits*
> >
> > - Unlocks support for **tens of thousands of columns** per MegaPage.
> > - Better space efficiency for sparse batches.
> > - Retains backward compatibility: Already ingested MegaLeafs can also be
> > read.
> >
> > This change is essential for evolving workloads that increasingly rely on
> > flexible schemas and sparse data layouts.
> >

Reply via email to