Re: [PR] Use docs from parquet-format as source of truth [parquet-site]

via GitHub Wed, 03 Dec 2025 12:40:14 -0800


alamb commented on code in PR #142:
URL: https://github.com/apache/parquet-site/pull/142#discussion_r2586492741



##########
content/en/docs/File Format/Data Pages/compression.md:
##########
@@ -1,83 +1,6 @@
 ---
-title: "Compression"
 linkTitle: "Compression"
 weight: 1
 ---
-## Overview

Review Comment:
   I rendered this locally and it looks good to me
   
   <img width="1301" height="1187" alt="Screenshot 2025-12-03 at 3 18 18 PM" 
src="https://github.com/user-attachments/assets/17bdf29c-b42b-4f16-b65e-ba42c0563bdc";
 />
   



##########
hugo.toml:
##########
@@ -171,6 +171,7 @@ desc = "Parquet specification"
 
 
 [module]
+

Review Comment:
   nit is that the changes in this file seem unecessary



##########
content/en/docs/File Format/Types/VariantEncoding.md:
##########
@@ -0,0 +1,7 @@
+---

Review Comment:
   This is really nice to have variant on the webpage now 
   
   <img width="1360" height="1168" alt="Screenshot 2025-12-03 at 3 21 03 PM" 
src="https://github.com/user-attachments/assets/2f0c91de-f39c-46ed-b46f-1fcb48c5ca49";
 />
   



##########
static/docs/file-format/pageindex/src/main/thrift/parquet.thrift:
##########
@@ -0,0 +1,15 @@
+<!DOCTYPE html>

Review Comment:
   This is pretty clever. It might be worth some comments explaining what it 
does



##########
content/en/docs/File Format/pageindex.md:
##########
@@ -1,85 +1,6 @@
 ---
-title: "Page Index"
 linkTitle: "Page Index"
 weight: 7
 ---
-This document describes the format for column index pages in the Parquet
-footer. These pages contain statistics for DataPages and can be used to skip
-pages when scanning data in ordered and unordered columns.
 
-## Problem Statement
-In previous versions of the format, Statistics are stored for ColumnChunks in
-ColumnMetaData and for individual pages inside DataPageHeader structs. When
-reading pages, a reader had to process the page header to determine
-whether the page could be skipped based on the statistics. This means the 
reader
-had to access all pages in a column, thus likely reading most of the column
-data from disk.
-
-## Goals
-1. Make both range scans and point lookups I/O efficient by allowing direct
-   access to pages based on their min and max values. In particular:
-    *  A single-row lookup in a row group based on the sort column of that row 
group
-  will only read one data page per the retrieved column.
-    * Range scans on the sort column will only need to read the exact data 
-      pages that contain relevant data.
-    * Make other selective scans I/O efficient: if we have a very selective
-      predicate on a non-sorting column, for the other retrieved columns we
-      should only need to access data pages that contain matching rows.
-2. No additional decoding effort for scans without selective predicates, e.g.,
-   full-row group scans. If a reader determines that it does not need to read 
-   the index data, it does not incur any overhead.
-3. Index pages for sorted columns use minimal storage by storing only the
-   boundary elements between pages.
-
-## Non-Goals
-* Support for the equivalent of secondary indices, i.e., an index structure
-  sorted on the key values over non-sorted data.
-
-
-## Technical Approach
-
-We add two new per-column structures to the row group metadata:
-* ColumnIndex: this allows navigation to the pages of a column based on column
-  values and is used to locate data pages that contain matching values for a
-  scan predicate
-* OffsetIndex: this allows navigation by row index and is used to retrieve
-  values for rows identified as matches via the ColumnIndex. Once rows of a
-  column are skipped, the corresponding rows in the other columns have to be
-  skipped. Hence the OffsetIndexes for each column in a RowGroup are stored
-  together.
-
-The new index structures are stored separately from RowGroup, near the footer. 
 
-This is done so that a reader does not have to pay the I/O and deserialization 
-cost for reading them if it is not doing selective scans. The index structures'
-location and length are stored in ColumnChunk.
-
- ![Page Index Layout](/images/PageIndexLayout.png)

Review Comment:
   The image from 
https://github.com/apache/parquet-format/blob/master/PageIndex.md#technical-approach
 doesn't seem to be visible anymore:
   
   <img width="971" height="458" alt="Screenshot 2025-12-03 at 3 34 57 PM" 
src="https://github.com/user-attachments/assets/255e92a2-80b5-4e4d-a501-b34aebeb43ab";
 />
   
   
   
   The image appears to be there in `public/images/PageIndexLayout.png` but the 
rendered link is `doc/images/PageIndexLayout.png`
   
   <img width="1578" height="841" alt="Screenshot 2025-12-03 at 3 38 22 PM" 
src="https://github.com/user-attachments/assets/35b112a2-4a18-44f3-a62d-02a35c44c628";
 />
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Use docs from parquet-format as source of truth [parquet-site]

Reply via email to