+1
On Thu, Jun 26, 2025 at 3:08 PM Mike Carey <[email protected]> wrote: > > This looks excellent! > > +1 for adopting this extension to our storage system ASAP. > > On 6/26/25 11:53 AM, Ritik Raj wrote: > > *Expanding PageZero to Support Unlimited Columns* > > APE: > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/ASTERIXDB/APE*23*3A*Unlimited*Columns*Support__;KyUrKys!!CzAuKJ42GuquVTTmVmPViYEvSg!P1EHZSwq7hcpOlyHuy7R1F0lAkJK31elLGusrjb58xBVxuuNH4gxpVwKRuJSv9mByOtN5siVn5A6sQ$ > > > > In the columnar storage format, each MegaPage represents a logical leaf > > node and begins with `PageZero`, a metadata section that captures essential > > column metadata including column offsets and min/max filters. Originally, > > `PageZero` was constrained to reside in a single page (typically 128KB), > > with a fixed layout that stored information for **every column** in the > > global schema. > > > > Each column entry consumed 4 bytes for offset and 16 bytes for a min/max > > filter, leading to a **metadata footprint of 20 bytes per column**. With > > this layout, the **maximum number of columns supported was capped at > > ~6,000**, given space constraints and the need to reserve part of > > `PageZero` for primary key metadata and structural headers. > > > > This limitation became problematic for datasets with **wide or sparse > > schemas**, where many columns may be missing in individual document batches > > but still occupy space in `PageZero`. The presence of unused metadata > > bloated the footprint and limited scalability. > > > > *Multi-Segment PageZero: Motivation and Layout* > > > > To overcome this limitation, we introduce **multi-segment support in > > PageZero**. Instead of storing all metadata in a single fixed block, we > > partition PageZero into multiple **segments**, with the **first (zeroth) > > segment storing primary key metadata and as many column entries as it can > > fit**, and subsequent segments storing the remaining metadata. > > > > Each segment follows the same layout: column index → offset → min → max, > > stored in an interleaved manner. This structure ensures efficient scan and > > lookup, while enabling us to scale to **arbitrarily many columns**, bounded > > only by MegaPage size. > > > > *Segment Layout:* > > > > ``` > > [ Segment Header ] > > ├─ Number of Columns > > ├─ Max Column Index in Segment > > [ Interleaved Metadata Entries ] > > ├─ ColumnIndex₁, Offset₁, Min₁, Max₁ > > ├─ ColumnIndex₂, Offset₂, Min₂, Max₂ > > └─ ... > > ``` > > > > A new `DefaultColumnMultiPageZeroWriter` class was introduced to manage > > this segmented layout. It delegates metadata writing to individual segments > > while maintaining headers at the top-level for navigation. > > > > *Adaptive Writer Selection* > > > > To avoid burdening all batches with this segmented structure, we retain the > > `DefaultColumnPageZeroWriter` for small or dense schemas. A new **adaptive > > selection mechanism** compares space usage of both writers for a batch and > > picks the optimal one. > > > > The decision logic weighs: > > - Space taken by Default Multi-segment writer (fixed layout for all columns) > > - Space taken by Sparse Multi-Segment writer (compact layout for present > > columns) > > > > This logic is encapsulated in `PageZeroWriterFlavorSelector`. > > > > *New Configuration Options:* > > > > Two new storage configuration parameters have been introduced: > > > > 1. **`STORAGE_MAX_COLUMNS_IN_ZEROTH_SEGMENT`** (`INTEGER_BYTE_UNIT`, > > default: `5000`) > > Controls the maximum number of columns that can be stored in the zeroth > > segment of `PageZero`. Remaining columns, if any, are offloaded to > > additional segments. This helps balance lookup performance (fast for zeroth > > segment) and scalability. This might change based on perf experiments. > > > > 2. **`STORAGE_PAGE_ZERO_WRITER`** (`STRING`, default: `"default"`) > > Controls the writer strategy used during flush. Accepted values are: > > - `"default"`: Always use the legacy writer. > > - `"sparse"`: Always use the sparse writer (only present columns). > > - `"adaptive"`: Dynamically compare both and pick the writer that uses > > less space. > > > > *Summary of Changes* > > > > - Interleaved layout per segment for columnIndex, offset, min, max. > > - Logic to estimate the number of segments and assign columns to segments. > > - Writer is selected dynamically using `PageZeroWriterFlavorSelector`. > > > > *Benefits* > > > > - Unlocks support for **tens of thousands of columns** per MegaPage. > > - Better space efficiency for sparse batches. > > - Retains backward compatibility: Already ingested MegaLeafs can also be > > read. > > > > This change is essential for evolving workloads that increasingly rely on > > flexible schemas and sparse data layouts. > >
