latest: update links

shangxinli Fri, 07 Jan 2022 18:18:27 -0800

This is an automated email from the ASF dual-hosted git repository.

shangxinli pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/parquet-site.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 156f6c5  source/documentation/latest: update links
     new 53010ca  Merge pull request #7 from kevinburkesegment/update-links
156f6c5 is described below

commit 156f6c55cf1fd83a9f30f332b8fbec6e79813b9b
Author: Kevin Burke <[email protected]>
AuthorDate: Wed Jan 5 14:09:34 2022 -0800

    source/documentation/latest: update links
    
    The compatibility library has not been kept up to date and the Rust
    library previously linked to a read-only repository. Update both.
    
    I am using a M1 Mac and needed to update the ffi dependency in order
    to get middleman to work correctly; I can revert that change or commit
    it separately.
---
 Gemfile.lock                           |  4 +-
 output/documentation/latest/index.html | 92 +++++++++++++++++-----------------
 source/documentation/latest.html.md    | 92 +++++++++++++++++-----------------
 3 files changed, 94 insertions(+), 94 deletions(-)

diff --git a/Gemfile.lock b/Gemfile.lock
index 0504300..1553ade 100644
--- a/Gemfile.lock
+++ b/Gemfile.lock
@@ -30,7 +30,7 @@ GEM
     execjs (2.7.0)
     fast_blank (1.0.0)
     fastimage (2.1.7)
-    ffi (1.11.1)
+    ffi (1.15.4)
     haml (5.1.2)
       temple (>= 0.8.0)
       tilt
@@ -136,4 +136,4 @@ DEPENDENCIES
   redcarpet!
 
 BUNDLED WITH
-   2.0.2
+   2.3.4
diff --git a/output/documentation/latest/index.html 
b/output/documentation/latest/index.html
index 84181bd..c496e22 100644
--- a/output/documentation/latest/index.html
+++ b/output/documentation/latest/index.html
@@ -146,9 +146,9 @@
 
 <p>The <a href="https://github.com/apache/parquet-cpp";>parquet-cpp</a> project 
is a C++ library to read-write Parquet files.</p>
 
-<p>The <a href="https://github.com/sunchao/parquet-rs";>parquet-rs</a> project 
is a Rust library to read-write Parquet files.</p>
+<p>The <a 
href="https://github.com/apache/arrow-rs/tree/master/parquet";>parquet-rs</a> 
project is a Rust library to read-write Parquet files.</p>
 
-<p>The <a 
href="https://github.com/Parquet/parquet-compatibility";>parquet-compatibility</a>
 project contains compatibility tests that can be used to verify that 
implementations in different languages can read and write each other&rsquo;s 
files.</p>
+<p>The <a 
href="https://github.com/Parquet/parquet-compatibility";>parquet-compatibility</a>
 project (deprecated) contains compatibility tests that can be used to verify 
that implementations in different languages can read and write each 
other&rsquo;s files. As of January 2022 compatibility tests only exist up to 
version 1.2.0.</p>
 
 <h2 id="building">Building</h2>
 
@@ -165,8 +165,8 @@
 <h2 id="glossary">Glossary</h2>
 
 <ul>
-<li><p>Block (hdfs block): This means a block in hdfs and the meaning is 
-unchanged for describing this file format.  The file format is 
+<li><p>Block (hdfs block): This means a block in hdfs and the meaning is
+unchanged for describing this file format.  The file format is
 designed to work well on top of hdfs.</p></li>
 <li><p>File: A hdfs file that must include the metadata for the file.
 It does not need to actually contain the data.</p></li>
@@ -182,7 +182,7 @@ be multiple page types which is interleaved in a column 
chunk.</p></li>
 
 <p>Hierarchically, a file consists of one or more row groups.  A row group
 contains exactly one column chunk per column.  Column chunks contain one or
-more pages. </p>
+more pages.</p>
 
 <h2 id="unit-of-parallelization">Unit of parallelization</h2>
 
@@ -213,14 +213,14 @@ File Metadata
 4-byte length in bytes of file metadata
 4-byte magic number "PAR1"
 </code></pre></div>
-<p>In the above example, there are N columns in this table, split into M row 
-groups.  The file metadata contains the locations of all the column metadata 
-start locations.  More details on what is contained in the metadata can be 
found 
+<p>In the above example, there are N columns in this table, split into M row
+groups.  The file metadata contains the locations of all the column metadata
+start locations.  More details on what is contained in the metadata can be 
found
 in the thrift files.</p>
 
 <p>Metadata is written after the data to allow for single pass writing.</p>
 
-<p>Readers are expected to first read the file metadata to find all the column 
+<p>Readers are expected to first read the file metadata to find all the column
 chunks they are interested in.  The columns chunks should then be read 
sequentially.</p>
 
 <p><img 
src="https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif";
 alt="File Layout" /></p>
@@ -263,31 +263,31 @@ documented in
 
 <h2 id="nested-encoding">Nested Encoding</h2>
 
-<p>To encode nested columns, Parquet uses the Dremel encoding with definition 
and 
-repetition levels.  Definition levels specify how many optional fields in the 
+<p>To encode nested columns, Parquet uses the Dremel encoding with definition 
and
+repetition levels.  Definition levels specify how many optional fields in the
 path for the column are defined.  Repetition levels specify at what repeated 
field
 in the path has the value repeated.  The max definition and repetition levels 
can
 be computed from the schema (i.e. how much nesting there is).  This defines the
 maximum number of bits required to store the levels (levels are defined for all
-values in the column).  </p>
+values in the column).</p>
 
 <p>Two encodings for the levels are supported BIT<em>PACKED and RLE. Only RLE 
is now used as it supersedes BIT</em>PACKED.</p>
 
 <h2 id="nulls">Nulls</h2>
 
-<p>Nullity is encoded in the definition levels (which is run-length encoded).  
NULL values 
-are not encoded in the data.  For example, in a non-nested schema, a column 
with 1000 NULLs 
+<p>Nullity is encoded in the definition levels (which is run-length encoded).  
NULL values
+are not encoded in the data.  For example, in a non-nested schema, a column 
with 1000 NULLs
 would be encoded with run-length encoding (0, 1000 times) for the definition 
levels and
-nothing else.  </p>
+nothing else.</p>
 
 <h2 id="data-pages">Data Pages</h2>
 
 <p>For data pages, the 3 pieces of information are encoded back to back, after 
the page
-header. We have the </p>
+header. We have the</p>
 
 <ul>
-<li>definition levels data,<br></li>
-<li>repetition levels data, </li>
+<li>definition levels data,</li>
+<li>repetition levels data,</li>
 <li>encoded values.
 The size of specified in the header is for all 3 pieces combined.</li>
 </ul>
@@ -296,7 +296,7 @@ The size of specified in the header is for all 3 pieces 
combined.</li>
 are optional, based on the schema definition.  If the column is not nested 
(i.e.
 the path to the column has length 1), we do not encode the repetition levels 
(it would
 always have the value 1).  For data that is required, the definition levels are
-skipped (if encoded, it will always have the value of the max definition 
level). </p>
+skipped (if encoded, it will always have the value of the max definition 
level).</p>
 
 <p>For example, in the case where the column is non-nested and required, the 
data in the
 page is only the encoded values.</p>
@@ -305,52 +305,52 @@ page is only the encoded values.</p>
 
 <h2 id="column-chunks">Column chunks</h2>
 
-<p>Column chunks are composed of pages written back to back.  The pages share 
a common 
-header and readers can skip over page they are not interested in.  The data 
for the 
-page follows the header and can be compressed and/or encoded.  The compression 
and 
+<p>Column chunks are composed of pages written back to back.  The pages share 
a common
+header and readers can skip over page they are not interested in.  The data 
for the
+page follows the header and can be compressed and/or encoded.  The compression 
and
 encoding is specified in the page metadata.</p>
 
 <h2 id="checksumming">Checksumming</h2>
 
-<p>Data pages can be individually checksummed.  This allows disabling of 
checksums at the 
+<p>Data pages can be individually checksummed.  This allows disabling of 
checksums at the
 HDFS file level, to better support single row lookups.</p>
 
 <h2 id="error-recovery">Error recovery</h2>
 
-<p>If the file metadata is corrupt, the file is lost.  If the column metdata 
is corrupt, 
-that column chunk is lost (but column chunks for this column in other row 
groups are 
-okay).  If a page header is corrupt, the remaining pages in that chunk are 
lost.  If 
-the data within a page is corrupt, that page is lost.  The file will be more 
+<p>If the file metadata is corrupt, the file is lost.  If the column metdata 
is corrupt,
+that column chunk is lost (but column chunks for this column in other row 
groups are
+okay).  If a page header is corrupt, the remaining pages in that chunk are 
lost.  If
+the data within a page is corrupt, that page is lost.  The file will be more
 resilient to corruption with smaller row groups.</p>
 
-<p>Potential extension: With smaller row groups, the biggest issue is placing 
the file 
-metadata at the end.  If an error happens while writing the file metadata, all 
the 
-data written will be unreadable.  This can be fixed by writing the file 
metadata 
-every Nth row group.<br>
-Each file metadata would be cumulative and include all the row groups written 
so 
-far.  Combining this with the strategy used for rc or avro files using sync 
markers, 
-a reader could recover partially written files.  </p>
+<p>Potential extension: With smaller row groups, the biggest issue is placing 
the file
+metadata at the end.  If an error happens while writing the file metadata, all 
the
+data written will be unreadable.  This can be fixed by writing the file 
metadata
+every Nth row group.
+Each file metadata would be cumulative and include all the row groups written 
so
+far.  Combining this with the strategy used for rc or avro files using sync 
markers,
+a reader could recover partially written files.</p>
 
 <h2 id="separating-metadata-and-column-data">Separating metadata and column 
data.</h2>
 
 <p>The format is explicitly designed to separate the metadata from the data. 
This
 allows splitting columns into multiple files, as well as having a single 
metadata
-file reference multiple parquet files.  </p>
+file reference multiple parquet files.</p>
 
 <h2 id="configurations">Configurations</h2>
 
 <ul>
-<li>Row group size: Larger row groups allow for larger column chunks which 
makes it 
-possible to do larger sequential IO.  Larger groups also require more 
buffering in 
-the write path (or a two pass write).  We recommend large row groups (512MB - 
1GB). 
-Since an entire row group might need to be read, we want it to completely fit 
on 
-one HDFS block.  Therefore, HDFS block sizes should also be set to be larger.  
An 
-optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS 
block 
+<li>Row group size: Larger row groups allow for larger column chunks which 
makes it
+possible to do larger sequential IO.  Larger groups also require more 
buffering in
+the write path (or a two pass write).  We recommend large row groups (512MB - 
1GB).
+Since an entire row group might need to be read, we want it to completely fit 
on
+one HDFS block.  Therefore, HDFS block sizes should also be set to be larger.  
An
+optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS 
block
 per HDFS file.</li>
-<li>Data page size: Data pages should be considered indivisible so smaller 
data pages 
-allow for more fine grained reading (e.g. single row lookup).  Larger page 
sizes 
-incur less space overhead (less page headers) and potentially less parsing 
overhead 
-(processing headers).  Note: for sequential scans, it is not expected to read 
a page 
+<li>Data page size: Data pages should be considered indivisible so smaller 
data pages
+allow for more fine grained reading (e.g. single row lookup).  Larger page 
sizes
+incur less space overhead (less page headers) and potentially less parsing 
overhead
+(processing headers).  Note: for sequential scans, it is not expected to read 
a page
 at a time; this is not the IO chunk.  We recommend 8KB for page sizes.</li>
 </ul>
 
@@ -360,7 +360,7 @@ at a time; this is not the IO chunk.  We recommend 8KB for 
page sizes.</li>
 
 <ul>
 <li>File Version: The file metadata contains a version.</li>
-<li>Encodings: Encodings are specified by enum and more can be added in the 
future.<br></li>
+<li>Encodings: Encodings are specified by enum and more can be added in the 
future.</li>
 <li>Page types: Additional page types can be added and safely skipped.</li>
 </ul>
 
diff --git a/source/documentation/latest.html.md 
b/source/documentation/latest.html.md
index ad3a6b6..b9df579 100644
--- a/source/documentation/latest.html.md
+++ b/source/documentation/latest.html.md
@@ -16,9 +16,9 @@ The [parquet-mr](https://github.com/apache/parquet-mr) 
project contains multiple
 
 The [parquet-cpp](https://github.com/apache/parquet-cpp) project is a C++ 
library to read-write Parquet files.
 
-The [parquet-rs](https://github.com/sunchao/parquet-rs) project is a Rust 
library to read-write Parquet files.
+The [parquet-rs](https://github.com/apache/arrow-rs/tree/master/parquet) 
project is a Rust library to read-write Parquet files.
 
-The [parquet-compatibility](https://github.com/Parquet/parquet-compatibility) 
project contains compatibility tests that can be used to verify that 
implementations in different languages can read and write each other's files.
+The [parquet-compatibility](https://github.com/Parquet/parquet-compatibility) 
project (deprecated) contains compatibility tests that can be used to verify 
that implementations in different languages can read and write each other's 
files. As of January 2022 compatibility tests only exist up to version 1.2.0.
 
 ## Building
 
@@ -35,8 +35,8 @@ See [How to Release][how-to-release].
 [how-to-release]: ../how-to-release/
 
 ## Glossary
-  - Block (hdfs block): This means a block in hdfs and the meaning is 
-    unchanged for describing this file format.  The file format is 
+  - Block (hdfs block): This means a block in hdfs and the meaning is
+    unchanged for describing this file format.  The file format is
     designed to work well on top of hdfs.
 
   - File: A hdfs file that must include the metadata for the file.
@@ -55,7 +55,7 @@ See [How to Release][how-to-release].
 
 Hierarchically, a file consists of one or more row groups.  A row group
 contains exactly one column chunk per column.  Column chunks contain one or
-more pages. 
+more pages.
 
 ## Unit of parallelization
   - MapReduce - File/Row Group
@@ -83,14 +83,14 @@ This file and the thrift definition should be read together 
to understand the fo
     4-byte length in bytes of file metadata
     4-byte magic number "PAR1"
 
-In the above example, there are N columns in this table, split into M row 
-groups.  The file metadata contains the locations of all the column metadata 
-start locations.  More details on what is contained in the metadata can be 
found 
+In the above example, there are N columns in this table, split into M row
+groups.  The file metadata contains the locations of all the column metadata
+start locations.  More details on what is contained in the metadata can be 
found
 in the thrift files.
 
 Metadata is written after the data to allow for single pass writing.
 
-Readers are expected to first read the file metadata to find all the column 
+Readers are expected to first read the file metadata to find all the column
 chunks they are interested in.  The columns chunks should then be read 
sequentially.
 
  ![File 
Layout](https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif)
@@ -129,28 +129,28 @@ documented in
 [logical-types]: 
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
 
 ## Nested Encoding
-To encode nested columns, Parquet uses the Dremel encoding with definition and 
-repetition levels.  Definition levels specify how many optional fields in the 
+To encode nested columns, Parquet uses the Dremel encoding with definition and
+repetition levels.  Definition levels specify how many optional fields in the
 path for the column are defined.  Repetition levels specify at what repeated 
field
 in the path has the value repeated.  The max definition and repetition levels 
can
 be computed from the schema (i.e. how much nesting there is).  This defines the
 maximum number of bits required to store the levels (levels are defined for all
-values in the column).  
+values in the column).
 
 Two encodings for the levels are supported BIT_PACKED and RLE. Only RLE is now 
used as it supersedes BIT_PACKED.
 
 ## Nulls
-Nullity is encoded in the definition levels (which is run-length encoded).  
NULL values 
-are not encoded in the data.  For example, in a non-nested schema, a column 
with 1000 NULLs 
+Nullity is encoded in the definition levels (which is run-length encoded).  
NULL values
+are not encoded in the data.  For example, in a non-nested schema, a column 
with 1000 NULLs
 would be encoded with run-length encoding (0, 1000 times) for the definition 
levels and
-nothing else.  
+nothing else.
 
 ## Data Pages
 For data pages, the 3 pieces of information are encoded back to back, after 
the page
-header. We have the 
+header. We have the
 
- - definition levels data,  
- - repetition levels data, 
+ - definition levels data,
+ - repetition levels data,
  - encoded values.
 The size of specified in the header is for all 3 pieces combined.
 
@@ -158,7 +158,7 @@ The data for the data page is always required.  The 
definition and repetition le
 are optional, based on the schema definition.  If the column is not nested 
(i.e.
 the path to the column has length 1), we do not encode the repetition levels 
(it would
 always have the value 1).  For data that is required, the definition levels are
-skipped (if encoded, it will always have the value of the max definition 
level). 
+skipped (if encoded, it will always have the value of the max definition 
level).
 
 For example, in the case where the column is non-nested and required, the data 
in the
 page is only the encoded values.
@@ -166,52 +166,52 @@ page is only the encoded values.
 The supported encodings are described in 
[Encodings.md](https://github.com/apache/parquet-format/blob/master/Encodings.md)
 
 ## Column chunks
-Column chunks are composed of pages written back to back.  The pages share a 
common 
-header and readers can skip over page they are not interested in.  The data 
for the 
-page follows the header and can be compressed and/or encoded.  The compression 
and 
+Column chunks are composed of pages written back to back.  The pages share a 
common
+header and readers can skip over page they are not interested in.  The data 
for the
+page follows the header and can be compressed and/or encoded.  The compression 
and
 encoding is specified in the page metadata.
 
 ## Checksumming
-Data pages can be individually checksummed.  This allows disabling of 
checksums at the 
+Data pages can be individually checksummed.  This allows disabling of 
checksums at the
 HDFS file level, to better support single row lookups.
 
 ## Error recovery
-If the file metadata is corrupt, the file is lost.  If the column metdata is 
corrupt, 
-that column chunk is lost (but column chunks for this column in other row 
groups are 
-okay).  If a page header is corrupt, the remaining pages in that chunk are 
lost.  If 
-the data within a page is corrupt, that page is lost.  The file will be more 
+If the file metadata is corrupt, the file is lost.  If the column metdata is 
corrupt,
+that column chunk is lost (but column chunks for this column in other row 
groups are
+okay).  If a page header is corrupt, the remaining pages in that chunk are 
lost.  If
+the data within a page is corrupt, that page is lost.  The file will be more
 resilient to corruption with smaller row groups.
 
-Potential extension: With smaller row groups, the biggest issue is placing the 
file 
-metadata at the end.  If an error happens while writing the file metadata, all 
the 
-data written will be unreadable.  This can be fixed by writing the file 
metadata 
-every Nth row group.  
-Each file metadata would be cumulative and include all the row groups written 
so 
-far.  Combining this with the strategy used for rc or avro files using sync 
markers, 
-a reader could recover partially written files.  
+Potential extension: With smaller row groups, the biggest issue is placing the 
file
+metadata at the end.  If an error happens while writing the file metadata, all 
the
+data written will be unreadable.  This can be fixed by writing the file 
metadata
+every Nth row group.
+Each file metadata would be cumulative and include all the row groups written 
so
+far.  Combining this with the strategy used for rc or avro files using sync 
markers,
+a reader could recover partially written files.
 
 ## Separating metadata and column data.
 The format is explicitly designed to separate the metadata from the data. This
 allows splitting columns into multiple files, as well as having a single 
metadata
-file reference multiple parquet files.  
+file reference multiple parquet files.
 
 ## Configurations
-- Row group size: Larger row groups allow for larger column chunks which makes 
it 
-possible to do larger sequential IO.  Larger groups also require more 
buffering in 
-the write path (or a two pass write).  We recommend large row groups (512MB - 
1GB). 
-Since an entire row group might need to be read, we want it to completely fit 
on 
-one HDFS block.  Therefore, HDFS block sizes should also be set to be larger.  
An 
-optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS 
block 
+- Row group size: Larger row groups allow for larger column chunks which makes 
it
+possible to do larger sequential IO.  Larger groups also require more 
buffering in
+the write path (or a two pass write).  We recommend large row groups (512MB - 
1GB).
+Since an entire row group might need to be read, we want it to completely fit 
on
+one HDFS block.  Therefore, HDFS block sizes should also be set to be larger.  
An
+optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS 
block
 per HDFS file.
-- Data page size: Data pages should be considered indivisible so smaller data 
pages 
-allow for more fine grained reading (e.g. single row lookup).  Larger page 
sizes 
-incur less space overhead (less page headers) and potentially less parsing 
overhead 
-(processing headers).  Note: for sequential scans, it is not expected to read 
a page 
+- Data page size: Data pages should be considered indivisible so smaller data 
pages
+allow for more fine grained reading (e.g. single row lookup).  Larger page 
sizes
+incur less space overhead (less page headers) and potentially less parsing 
overhead
+(processing headers).  Note: for sequential scans, it is not expected to read 
a page
 at a time; this is not the IO chunk.  We recommend 8KB for page sizes.
 
 ## Extensibility
 There are many places in the format for compatible extensions:
 
 - File Version: The file metadata contains a version.
-- Encodings: Encodings are specified by enum and more can be added in the 
future.  
+- Encodings: Encodings are specified by enum and more can be added in the 
future.
 - Page types: Additional page types can be added and safely skipped.

[parquet-site] branch asf-site updated: source/documentation/latest: update links

Reply via email to