This is an automated email from the ASF dual-hosted git repository.
shangxinli pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/parquet-site.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 156f6c5 source/documentation/latest: update links
new 53010ca Merge pull request #7 from kevinburkesegment/update-links
156f6c5 is described below
commit 156f6c55cf1fd83a9f30f332b8fbec6e79813b9b
Author: Kevin Burke <[email protected]>
AuthorDate: Wed Jan 5 14:09:34 2022 -0800
source/documentation/latest: update links
The compatibility library has not been kept up to date and the Rust
library previously linked to a read-only repository. Update both.
I am using a M1 Mac and needed to update the ffi dependency in order
to get middleman to work correctly; I can revert that change or commit
it separately.
---
Gemfile.lock | 4 +-
output/documentation/latest/index.html | 92 +++++++++++++++++-----------------
source/documentation/latest.html.md | 92 +++++++++++++++++-----------------
3 files changed, 94 insertions(+), 94 deletions(-)
diff --git a/Gemfile.lock b/Gemfile.lock
index 0504300..1553ade 100644
--- a/Gemfile.lock
+++ b/Gemfile.lock
@@ -30,7 +30,7 @@ GEM
execjs (2.7.0)
fast_blank (1.0.0)
fastimage (2.1.7)
- ffi (1.11.1)
+ ffi (1.15.4)
haml (5.1.2)
temple (>= 0.8.0)
tilt
@@ -136,4 +136,4 @@ DEPENDENCIES
redcarpet!
BUNDLED WITH
- 2.0.2
+ 2.3.4
diff --git a/output/documentation/latest/index.html
b/output/documentation/latest/index.html
index 84181bd..c496e22 100644
--- a/output/documentation/latest/index.html
+++ b/output/documentation/latest/index.html
@@ -146,9 +146,9 @@
<p>The <a href="https://github.com/apache/parquet-cpp">parquet-cpp</a> project
is a C++ library to read-write Parquet files.</p>
-<p>The <a href="https://github.com/sunchao/parquet-rs">parquet-rs</a> project
is a Rust library to read-write Parquet files.</p>
+<p>The <a
href="https://github.com/apache/arrow-rs/tree/master/parquet">parquet-rs</a>
project is a Rust library to read-write Parquet files.</p>
-<p>The <a
href="https://github.com/Parquet/parquet-compatibility">parquet-compatibility</a>
project contains compatibility tests that can be used to verify that
implementations in different languages can read and write each other’s
files.</p>
+<p>The <a
href="https://github.com/Parquet/parquet-compatibility">parquet-compatibility</a>
project (deprecated) contains compatibility tests that can be used to verify
that implementations in different languages can read and write each
other’s files. As of January 2022 compatibility tests only exist up to
version 1.2.0.</p>
<h2 id="building">Building</h2>
@@ -165,8 +165,8 @@
<h2 id="glossary">Glossary</h2>
<ul>
-<li><p>Block (hdfs block): This means a block in hdfs and the meaning is
-unchanged for describing this file format. The file format is
+<li><p>Block (hdfs block): This means a block in hdfs and the meaning is
+unchanged for describing this file format. The file format is
designed to work well on top of hdfs.</p></li>
<li><p>File: A hdfs file that must include the metadata for the file.
It does not need to actually contain the data.</p></li>
@@ -182,7 +182,7 @@ be multiple page types which is interleaved in a column
chunk.</p></li>
<p>Hierarchically, a file consists of one or more row groups. A row group
contains exactly one column chunk per column. Column chunks contain one or
-more pages. </p>
+more pages.</p>
<h2 id="unit-of-parallelization">Unit of parallelization</h2>
@@ -213,14 +213,14 @@ File Metadata
4-byte length in bytes of file metadata
4-byte magic number "PAR1"
</code></pre></div>
-<p>In the above example, there are N columns in this table, split into M row
-groups. The file metadata contains the locations of all the column metadata
-start locations. More details on what is contained in the metadata can be
found
+<p>In the above example, there are N columns in this table, split into M row
+groups. The file metadata contains the locations of all the column metadata
+start locations. More details on what is contained in the metadata can be
found
in the thrift files.</p>
<p>Metadata is written after the data to allow for single pass writing.</p>
-<p>Readers are expected to first read the file metadata to find all the column
+<p>Readers are expected to first read the file metadata to find all the column
chunks they are interested in. The columns chunks should then be read
sequentially.</p>
<p><img
src="https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif"
alt="File Layout" /></p>
@@ -263,31 +263,31 @@ documented in
<h2 id="nested-encoding">Nested Encoding</h2>
-<p>To encode nested columns, Parquet uses the Dremel encoding with definition
and
-repetition levels. Definition levels specify how many optional fields in the
+<p>To encode nested columns, Parquet uses the Dremel encoding with definition
and
+repetition levels. Definition levels specify how many optional fields in the
path for the column are defined. Repetition levels specify at what repeated
field
in the path has the value repeated. The max definition and repetition levels
can
be computed from the schema (i.e. how much nesting there is). This defines the
maximum number of bits required to store the levels (levels are defined for all
-values in the column). </p>
+values in the column).</p>
<p>Two encodings for the levels are supported BIT<em>PACKED and RLE. Only RLE
is now used as it supersedes BIT</em>PACKED.</p>
<h2 id="nulls">Nulls</h2>
-<p>Nullity is encoded in the definition levels (which is run-length encoded).
NULL values
-are not encoded in the data. For example, in a non-nested schema, a column
with 1000 NULLs
+<p>Nullity is encoded in the definition levels (which is run-length encoded).
NULL values
+are not encoded in the data. For example, in a non-nested schema, a column
with 1000 NULLs
would be encoded with run-length encoding (0, 1000 times) for the definition
levels and
-nothing else. </p>
+nothing else.</p>
<h2 id="data-pages">Data Pages</h2>
<p>For data pages, the 3 pieces of information are encoded back to back, after
the page
-header. We have the </p>
+header. We have the</p>
<ul>
-<li>definition levels data,<br></li>
-<li>repetition levels data, </li>
+<li>definition levels data,</li>
+<li>repetition levels data,</li>
<li>encoded values.
The size of specified in the header is for all 3 pieces combined.</li>
</ul>
@@ -296,7 +296,7 @@ The size of specified in the header is for all 3 pieces
combined.</li>
are optional, based on the schema definition. If the column is not nested
(i.e.
the path to the column has length 1), we do not encode the repetition levels
(it would
always have the value 1). For data that is required, the definition levels are
-skipped (if encoded, it will always have the value of the max definition
level). </p>
+skipped (if encoded, it will always have the value of the max definition
level).</p>
<p>For example, in the case where the column is non-nested and required, the
data in the
page is only the encoded values.</p>
@@ -305,52 +305,52 @@ page is only the encoded values.</p>
<h2 id="column-chunks">Column chunks</h2>
-<p>Column chunks are composed of pages written back to back. The pages share
a common
-header and readers can skip over page they are not interested in. The data
for the
-page follows the header and can be compressed and/or encoded. The compression
and
+<p>Column chunks are composed of pages written back to back. The pages share
a common
+header and readers can skip over page they are not interested in. The data
for the
+page follows the header and can be compressed and/or encoded. The compression
and
encoding is specified in the page metadata.</p>
<h2 id="checksumming">Checksumming</h2>
-<p>Data pages can be individually checksummed. This allows disabling of
checksums at the
+<p>Data pages can be individually checksummed. This allows disabling of
checksums at the
HDFS file level, to better support single row lookups.</p>
<h2 id="error-recovery">Error recovery</h2>
-<p>If the file metadata is corrupt, the file is lost. If the column metdata
is corrupt,
-that column chunk is lost (but column chunks for this column in other row
groups are
-okay). If a page header is corrupt, the remaining pages in that chunk are
lost. If
-the data within a page is corrupt, that page is lost. The file will be more
+<p>If the file metadata is corrupt, the file is lost. If the column metdata
is corrupt,
+that column chunk is lost (but column chunks for this column in other row
groups are
+okay). If a page header is corrupt, the remaining pages in that chunk are
lost. If
+the data within a page is corrupt, that page is lost. The file will be more
resilient to corruption with smaller row groups.</p>
-<p>Potential extension: With smaller row groups, the biggest issue is placing
the file
-metadata at the end. If an error happens while writing the file metadata, all
the
-data written will be unreadable. This can be fixed by writing the file
metadata
-every Nth row group.<br>
-Each file metadata would be cumulative and include all the row groups written
so
-far. Combining this with the strategy used for rc or avro files using sync
markers,
-a reader could recover partially written files. </p>
+<p>Potential extension: With smaller row groups, the biggest issue is placing
the file
+metadata at the end. If an error happens while writing the file metadata, all
the
+data written will be unreadable. This can be fixed by writing the file
metadata
+every Nth row group.
+Each file metadata would be cumulative and include all the row groups written
so
+far. Combining this with the strategy used for rc or avro files using sync
markers,
+a reader could recover partially written files.</p>
<h2 id="separating-metadata-and-column-data">Separating metadata and column
data.</h2>
<p>The format is explicitly designed to separate the metadata from the data.
This
allows splitting columns into multiple files, as well as having a single
metadata
-file reference multiple parquet files. </p>
+file reference multiple parquet files.</p>
<h2 id="configurations">Configurations</h2>
<ul>
-<li>Row group size: Larger row groups allow for larger column chunks which
makes it
-possible to do larger sequential IO. Larger groups also require more
buffering in
-the write path (or a two pass write). We recommend large row groups (512MB -
1GB).
-Since an entire row group might need to be read, we want it to completely fit
on
-one HDFS block. Therefore, HDFS block sizes should also be set to be larger.
An
-optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS
block
+<li>Row group size: Larger row groups allow for larger column chunks which
makes it
+possible to do larger sequential IO. Larger groups also require more
buffering in
+the write path (or a two pass write). We recommend large row groups (512MB -
1GB).
+Since an entire row group might need to be read, we want it to completely fit
on
+one HDFS block. Therefore, HDFS block sizes should also be set to be larger.
An
+optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS
block
per HDFS file.</li>
-<li>Data page size: Data pages should be considered indivisible so smaller
data pages
-allow for more fine grained reading (e.g. single row lookup). Larger page
sizes
-incur less space overhead (less page headers) and potentially less parsing
overhead
-(processing headers). Note: for sequential scans, it is not expected to read
a page
+<li>Data page size: Data pages should be considered indivisible so smaller
data pages
+allow for more fine grained reading (e.g. single row lookup). Larger page
sizes
+incur less space overhead (less page headers) and potentially less parsing
overhead
+(processing headers). Note: for sequential scans, it is not expected to read
a page
at a time; this is not the IO chunk. We recommend 8KB for page sizes.</li>
</ul>
@@ -360,7 +360,7 @@ at a time; this is not the IO chunk. We recommend 8KB for
page sizes.</li>
<ul>
<li>File Version: The file metadata contains a version.</li>
-<li>Encodings: Encodings are specified by enum and more can be added in the
future.<br></li>
+<li>Encodings: Encodings are specified by enum and more can be added in the
future.</li>
<li>Page types: Additional page types can be added and safely skipped.</li>
</ul>
diff --git a/source/documentation/latest.html.md
b/source/documentation/latest.html.md
index ad3a6b6..b9df579 100644
--- a/source/documentation/latest.html.md
+++ b/source/documentation/latest.html.md
@@ -16,9 +16,9 @@ The [parquet-mr](https://github.com/apache/parquet-mr)
project contains multiple
The [parquet-cpp](https://github.com/apache/parquet-cpp) project is a C++
library to read-write Parquet files.
-The [parquet-rs](https://github.com/sunchao/parquet-rs) project is a Rust
library to read-write Parquet files.
+The [parquet-rs](https://github.com/apache/arrow-rs/tree/master/parquet)
project is a Rust library to read-write Parquet files.
-The [parquet-compatibility](https://github.com/Parquet/parquet-compatibility)
project contains compatibility tests that can be used to verify that
implementations in different languages can read and write each other's files.
+The [parquet-compatibility](https://github.com/Parquet/parquet-compatibility)
project (deprecated) contains compatibility tests that can be used to verify
that implementations in different languages can read and write each other's
files. As of January 2022 compatibility tests only exist up to version 1.2.0.
## Building
@@ -35,8 +35,8 @@ See [How to Release][how-to-release].
[how-to-release]: ../how-to-release/
## Glossary
- - Block (hdfs block): This means a block in hdfs and the meaning is
- unchanged for describing this file format. The file format is
+ - Block (hdfs block): This means a block in hdfs and the meaning is
+ unchanged for describing this file format. The file format is
designed to work well on top of hdfs.
- File: A hdfs file that must include the metadata for the file.
@@ -55,7 +55,7 @@ See [How to Release][how-to-release].
Hierarchically, a file consists of one or more row groups. A row group
contains exactly one column chunk per column. Column chunks contain one or
-more pages.
+more pages.
## Unit of parallelization
- MapReduce - File/Row Group
@@ -83,14 +83,14 @@ This file and the thrift definition should be read together
to understand the fo
4-byte length in bytes of file metadata
4-byte magic number "PAR1"
-In the above example, there are N columns in this table, split into M row
-groups. The file metadata contains the locations of all the column metadata
-start locations. More details on what is contained in the metadata can be
found
+In the above example, there are N columns in this table, split into M row
+groups. The file metadata contains the locations of all the column metadata
+start locations. More details on what is contained in the metadata can be
found
in the thrift files.
Metadata is written after the data to allow for single pass writing.
-Readers are expected to first read the file metadata to find all the column
+Readers are expected to first read the file metadata to find all the column
chunks they are interested in. The columns chunks should then be read
sequentially.

@@ -129,28 +129,28 @@ documented in
[logical-types]:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
## Nested Encoding
-To encode nested columns, Parquet uses the Dremel encoding with definition and
-repetition levels. Definition levels specify how many optional fields in the
+To encode nested columns, Parquet uses the Dremel encoding with definition and
+repetition levels. Definition levels specify how many optional fields in the
path for the column are defined. Repetition levels specify at what repeated
field
in the path has the value repeated. The max definition and repetition levels
can
be computed from the schema (i.e. how much nesting there is). This defines the
maximum number of bits required to store the levels (levels are defined for all
-values in the column).
+values in the column).
Two encodings for the levels are supported BIT_PACKED and RLE. Only RLE is now
used as it supersedes BIT_PACKED.
## Nulls
-Nullity is encoded in the definition levels (which is run-length encoded).
NULL values
-are not encoded in the data. For example, in a non-nested schema, a column
with 1000 NULLs
+Nullity is encoded in the definition levels (which is run-length encoded).
NULL values
+are not encoded in the data. For example, in a non-nested schema, a column
with 1000 NULLs
would be encoded with run-length encoding (0, 1000 times) for the definition
levels and
-nothing else.
+nothing else.
## Data Pages
For data pages, the 3 pieces of information are encoded back to back, after
the page
-header. We have the
+header. We have the
- - definition levels data,
- - repetition levels data,
+ - definition levels data,
+ - repetition levels data,
- encoded values.
The size of specified in the header is for all 3 pieces combined.
@@ -158,7 +158,7 @@ The data for the data page is always required. The
definition and repetition le
are optional, based on the schema definition. If the column is not nested
(i.e.
the path to the column has length 1), we do not encode the repetition levels
(it would
always have the value 1). For data that is required, the definition levels are
-skipped (if encoded, it will always have the value of the max definition
level).
+skipped (if encoded, it will always have the value of the max definition
level).
For example, in the case where the column is non-nested and required, the data
in the
page is only the encoded values.
@@ -166,52 +166,52 @@ page is only the encoded values.
The supported encodings are described in
[Encodings.md](https://github.com/apache/parquet-format/blob/master/Encodings.md)
## Column chunks
-Column chunks are composed of pages written back to back. The pages share a
common
-header and readers can skip over page they are not interested in. The data
for the
-page follows the header and can be compressed and/or encoded. The compression
and
+Column chunks are composed of pages written back to back. The pages share a
common
+header and readers can skip over page they are not interested in. The data
for the
+page follows the header and can be compressed and/or encoded. The compression
and
encoding is specified in the page metadata.
## Checksumming
-Data pages can be individually checksummed. This allows disabling of
checksums at the
+Data pages can be individually checksummed. This allows disabling of
checksums at the
HDFS file level, to better support single row lookups.
## Error recovery
-If the file metadata is corrupt, the file is lost. If the column metdata is
corrupt,
-that column chunk is lost (but column chunks for this column in other row
groups are
-okay). If a page header is corrupt, the remaining pages in that chunk are
lost. If
-the data within a page is corrupt, that page is lost. The file will be more
+If the file metadata is corrupt, the file is lost. If the column metdata is
corrupt,
+that column chunk is lost (but column chunks for this column in other row
groups are
+okay). If a page header is corrupt, the remaining pages in that chunk are
lost. If
+the data within a page is corrupt, that page is lost. The file will be more
resilient to corruption with smaller row groups.
-Potential extension: With smaller row groups, the biggest issue is placing the
file
-metadata at the end. If an error happens while writing the file metadata, all
the
-data written will be unreadable. This can be fixed by writing the file
metadata
-every Nth row group.
-Each file metadata would be cumulative and include all the row groups written
so
-far. Combining this with the strategy used for rc or avro files using sync
markers,
-a reader could recover partially written files.
+Potential extension: With smaller row groups, the biggest issue is placing the
file
+metadata at the end. If an error happens while writing the file metadata, all
the
+data written will be unreadable. This can be fixed by writing the file
metadata
+every Nth row group.
+Each file metadata would be cumulative and include all the row groups written
so
+far. Combining this with the strategy used for rc or avro files using sync
markers,
+a reader could recover partially written files.
## Separating metadata and column data.
The format is explicitly designed to separate the metadata from the data. This
allows splitting columns into multiple files, as well as having a single
metadata
-file reference multiple parquet files.
+file reference multiple parquet files.
## Configurations
-- Row group size: Larger row groups allow for larger column chunks which makes
it
-possible to do larger sequential IO. Larger groups also require more
buffering in
-the write path (or a two pass write). We recommend large row groups (512MB -
1GB).
-Since an entire row group might need to be read, we want it to completely fit
on
-one HDFS block. Therefore, HDFS block sizes should also be set to be larger.
An
-optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS
block
+- Row group size: Larger row groups allow for larger column chunks which makes
it
+possible to do larger sequential IO. Larger groups also require more
buffering in
+the write path (or a two pass write). We recommend large row groups (512MB -
1GB).
+Since an entire row group might need to be read, we want it to completely fit
on
+one HDFS block. Therefore, HDFS block sizes should also be set to be larger.
An
+optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS
block
per HDFS file.
-- Data page size: Data pages should be considered indivisible so smaller data
pages
-allow for more fine grained reading (e.g. single row lookup). Larger page
sizes
-incur less space overhead (less page headers) and potentially less parsing
overhead
-(processing headers). Note: for sequential scans, it is not expected to read
a page
+- Data page size: Data pages should be considered indivisible so smaller data
pages
+allow for more fine grained reading (e.g. single row lookup). Larger page
sizes
+incur less space overhead (less page headers) and potentially less parsing
overhead
+(processing headers). Note: for sequential scans, it is not expected to read
a page
at a time; this is not the IO chunk. We recommend 8KB for page sizes.
## Extensibility
There are many places in the format for compatible extensions:
- File Version: The file metadata contains a version.
-- Encodings: Encodings are specified by enum and more can be added in the
future.
+- Encodings: Encodings are specified by enum and more can be added in the
future.
- Page types: Additional page types can be added and safely skipped.