This is an automated email from the ASF dual-hosted git repository.
github-bot pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/parquet-site.git
The following commit(s) were added to refs/heads/asf-site by this push:
new f0a4c6a deploy: e79b30489c6bd50f0829a5f2b87f4a26f5e4af05
f0a4c6a is described below
commit f0a4c6ae94e382933d6ee77d12e7121d439bfd78
Author: Fokko <[email protected]>
AuthorDate: Mon Mar 11 21:11:51 2024 +0000
deploy: e79b30489c6bd50f0829a5f2b87f4a26f5e4af05
---
output/docs/_print/index.html | 26 +++++++++++-----------
.../docs/contribution-guidelines/_print/index.html | 6 ++---
.../contributing/index.html | 8 +++----
output/docs/contribution-guidelines/index.xml | 6 ++---
.../contribution-guidelines/releasing/index.html | 10 ++++-----
output/docs/file-format/_print/index.html | 20 ++++++++---------
output/docs/file-format/bloomfilter/index.html | 12 +++++-----
.../docs/file-format/data-pages/_print/index.html | 12 +++++-----
.../file-format/data-pages/compression/index.html | 10 ++++-----
.../file-format/data-pages/encodings/index.html | 10 ++++-----
.../file-format/data-pages/encryption/index.html | 8 +++----
output/docs/file-format/data-pages/index.xml | 12 +++++-----
output/docs/file-format/index.xml | 8 +++----
output/docs/index.xml | 26 +++++++++++-----------
output/sitemap.xml | 2 +-
15 files changed, 88 insertions(+), 88 deletions(-)
diff --git a/output/docs/_print/index.html b/output/docs/_print/index.html
index 1d0ee42..139db0c 100644
--- a/output/docs/_print/index.html
+++ b/output/docs/_print/index.html
@@ -169,7 +169,7 @@ chosen as follows:</p><div class=highlight><pre tabindex=0
style=background-colo
</span></span><span style=display:flex><span><span
style=color:#204a87;font-weight:700>unsigned</span> <span
style=color:#000>int64</span> <span style=color:#000>z_as_64_bit</span> <span
style=color:#ce5c00;font-weight:700>=</span> <span
style=color:#000>z</span><span style=color:#000;font-weight:700>;</span>
</span></span><span style=display:flex><span><span
style=color:#204a87;font-weight:700>unsigned</span> <span
style=color:#000>int32</span> <span style=color:#000>i</span> <span
style=color:#ce5c00;font-weight:700>=</span> <span
style=color:#000;font-weight:700>(</span><span
style=color:#000>h_top_bits</span> <span
style=color:#ce5c00;font-weight:700>*</span> <span
style=color:#000>z_as_64_bit</span><span
style=color:#000;font-weight:700>)</span> <span
style=color:#ce5c00;font-weight:700> [...]
</span></span></code></pre></div><p>The first line extracts the most
significant 32 bits from <code>h</code> and
-assignes them to a 64-bit unsigned integer. The second line is
+assigns them to a 64-bit unsigned integer. The second line is
simpler: it just sets an unsigned 64-bit value to the same value as
the 32-bit unsigned value <code>z</code>. The purpose of having both
<code>h_top_bits</code>
and <code>z_as_64_bit</code> be 64-bit values is so that their product is a
@@ -200,7 +200,7 @@ significant 32 bits.</p><pre tabindex=0><code>void
filter_insert(SBBF filter, un
block b = filter.getBlock(i);
return block_check(b, (unsigned int32)x)
}
-</code></pre><p>The use of blocks is from Putze et al.’s <a
href=http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf>Cache-,
Hash- and
+</code></pre><p>The use of blocks is from Putze et al.’s <a
href=https://www.cs.amherst.edu/~ccmcgeoch/cs34/papers/cacheefficientbloomfilters-jea.pdf>Cache-,
Hash- and
Space-Efficient Bloom
filters</a></p><p>To use an SBBF for values of arbitrary Parquet types, we
apply a hash
function to that value - at the time of writing,
@@ -208,12 +208,12 @@ function to that value - at the time of writing,
with a seed of 0 and <a
href=https://github.com/Cyan4973/xxHash/blob/v0.7.0/doc/xxhash_spec.md>following
the specification version
0.1.1</a>.</p><h4 id=sizing-an-sbbf>Sizing an SBBF</h4><p>The
<code>check</code> operation in SBBFs can return <code>true</code> for an
argument that
was never inserted into the SBBF. These are called “false
-positives”. The “false positive probabilty” is the
probability that
+positives”. The “false positive probability” is the
probability that
any given hash value that was never <code>insert</code>ed into the SBBF will
cause <code>check</code> to return <code>true</code> (a false positive). There
is not a
simple closed-form calculation of this probability, but here is an
example:</p><p>A filter that uses 1024 blocks and has had 26,214 hash values
-<code>insert</code>ed will have a false positive probabilty of around 1.26%.
Each
+<code>insert</code>ed will have a false positive probability of around 1.26%.
Each
of those 1024 blocks occupies 256 bits of space, so the total space
usage is 262,144. That means that the ratio of bits of space to hash
values is 10-to-1. Adding more hash values increases the denominator
@@ -314,7 +314,7 @@ If any ambiguity arises when implementing this format, the
implementation
provided by the <a href=https://zlib.net/>zlib compression library</a> is
authoritative.</p><p>Readers should support reading pages containing multiple
GZIP members, however,
as this has historically not been supported by all implementations, it is
recommended
that writers refrain from creating such pages by default for better
interoperability.</p><h3 id=lzo>LZO</h3><p>A codec based on or interoperable
with the
-<a href=http://www.oberhumer.com/opensource/lzo/>LZO compression
library</a>.</p><h3 id=brotli>BROTLI</h3><p>A codec based on the Brotli format
defined by
+<a href=https://www.oberhumer.com/opensource/lzo/>LZO compression
library</a>.</p><h3 id=brotli>BROTLI</h3><p>A codec based on the Brotli format
defined by
<a href=https://tools.ietf.org/html/rfc7932>RFC 7932</a>.
If any ambiguity arises when implementing this format, the implementation
provided by the <a href=https://github.com/google/brotli>Brotli compression
library</a>
@@ -326,10 +326,10 @@ this compression codec in their user-facing APIs, and
advise users to
switch to the newer, interoperable <code>LZ4_RAW</code> codec.</p><h3
id=zstd>ZSTD</h3><p>A codec based on the Zstandard format defined by
<a href=https://tools.ietf.org/html/rfc8478>RFC 8478</a>. If any ambiguity
arises
when implementing this format, the implementation provided by the
-<a href=https://facebook.github.io/zstd/>ZStandard compression library</a>
+<a href=https://facebook.github.io/zstd/>Zstandard compression library</a>
is authoritative.</p><h3 id=lz4_raw>LZ4_RAW</h3><p>A codec based on the <a
href=https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md>LZ4 block
format</a>.
If any ambiguity arises when implementing this format, the implementation
-provided by the <a href=http://www.lz4.org/>LZ4 compression library</a> is
authoritative.</p></div><div class=td-content
style=page-break-before:always><h1 id=pg-9aa971e751fdd370158d525ad337ef7a>3.7.2
- Encodings</h1><p><a name=PLAIN></a></p><h3 id=plain-plain--0>Plain: (PLAIN =
0)</h3><p>Supported Types: all</p><p>This is the plain encoding that must be
supported for types. It is
+provided by the <a href=https://www.lz4.org/>LZ4 compression library</a> is
authoritative.</p></div><div class=td-content
style=page-break-before:always><h1 id=pg-9aa971e751fdd370158d525ad337ef7a>3.7.2
- Encodings</h1><p><a name=PLAIN></a></p><h3 id=plain-plain--0>Plain: (PLAIN =
0)</h3><p>Supported Types: all</p><p>This is the plain encoding that must be
supported for types. It is
intended to be the simplest encoding. Values are encoded back to
back.</p><p>The plain encoding is used whenever a more efficient encoding can
not be used. It
stores the data in the following format:</p><ul><li>BOOLEAN: <a
href=/docs/file-format/data-pages/encodings/#BITPACKED>Bit Packed</a>, LSB
first</li><li>INT32: 4 bytes little endian</li><li>INT64: 8 bytes little
endian</li><li>INT96: 12 bytes little endian (deprecated)</li><li>FLOAT: 4
bytes IEEE little endian</li><li>DOUBLE: 8 bytes IEEE little
endian</li><li>BYTE_ARRAY: length in 4 bytes little endian followed by the
bytes contained in the array</li><li>FIXED_LEN_BYTE_ARRAY: the bytes [...]
point types are encoded in IEEE.</p><p>For the byte array type, it encodes the
length as a 4 byte little
@@ -397,7 +397,7 @@ bit label: ABC DEF GHI JKL MNO PQR STU VWX
bit label: ABCDEFGH IJKLMNOP QRSTUVWX
</code></pre><p>Note that the BIT_PACKED encoding method is only supported for
encoding
repetition and definition levels.</p><p><a name=DELTAENC></a></p><h3
id=delta-encoding-delta_binary_packed--5>Delta Encoding (DELTA_BINARY_PACKED =
5)</h3><p>Supported Types: INT32, INT64</p><p>This encoding is adapted from the
Binary packing described in
-<a href=http://arxiv.org/pdf/1209.2137v5.pdf>“Decoding billions of
integers per second through vectorization”</a>
+<a href=https://arxiv.org/pdf/1209.2137v5.pdf>“Decoding billions of
integers per second through vectorization”</a>
by D. Lemire and L. Boytsov.</p><p>In delta encoding we make use of variable
length integers for storing various
numbers (not the deltas themselves). For unsigned values, we use ULEB128,
which is the unsigned version of LEB128 (<a
href=https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128)>https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128)</a>.
@@ -409,7 +409,7 @@ quotient, the number of values in a miniblock, is a
multiple of 32; it is
stored as a ULEB128 int</li><li>the total value count is stored as a ULEB128
int</li><li>the first value is stored as a zigzag ULEB128 int</li></ul><p>Each
block contains</p><pre tabindex=0><code><min delta> <list of bitwidths
of miniblocks> <miniblocks>
</code></pre><ul><li>the min delta is a zigzag ULEB128 int (we compute a
minimum as we need
positive integers for bit packing)</li><li>the bitwidth of each block is
stored as a byte</li><li>each miniblock is a list of bit packed ints according
to the bit width
-stored at the begining of the block</li></ul><p>To encode a block, we
will:</p><ol><li><p>Compute the differences between consecutive elements. For
the first
+stored at the beginning of the block</li></ul><p>To encode a block, we
will:</p><ol><li><p>Compute the differences between consecutive elements. For
the first
element in the block, use the last element in the previous block or, in
the case of the first block, use the first value of the whole sequence,
stored in the header.</p></li><li><p>Compute the frame of reference (the
minimum of the deltas in the block).
@@ -558,7 +558,7 @@ data set (table). This string is optionally passed by a
writer upon file creatio
the AAD prefix is stored in an <code>aad_prefix</code> field in the file, and
is made available to the readers.
This field is not encrypted. If a user is concerned about keeping the file
identity inside the file,
the writer code can explicitly request Parquet not to store the AAD prefix.
Then the aad_prefix field
-will be empty; AAD prefixes must be fully managed by the caller code and
supplied explictly to Parquet
+will be empty; AAD prefixes must be fully managed by the caller code and
supplied explicitly to Parquet
readers for each file.</p><p>The protection against swapping full files is
optional. It is not enabled by default because
it requires the writers to generate and pass an AAD prefix.</p><p>A reader of
a file created with an AAD prefix, should be able to verify the prefix (file
identity)
by comparing it with e.g. the target table name, using a convention accepted
in the organization.
@@ -807,7 +807,7 @@ indices, and page offsets to scan in each column. The
reader can then
initialize a scanner for each column and fast forward them to the start row of
the scan.</p><p>The <code>min_values</code> and <code>max_values</code> are
calculated based on the <code>column_orders</code>
field in the <code>FileMetaData</code> struct of the footer.</p></div><div
class=td-content style=page-break-before:always><h1
id=pg-68f9113b50693620ee9892f038d4139b>4 - Developer Guide</h1><div
class=lead>All developer resources related to Parquet.</div><p>This section
contains the developer specific documentation related to Parquet.</p></div><div
class=td-content><h1 id=pg-f674d9b9519c822d1a39419a8e9167a5>4.1 -
Modules</h1><p>The <a href=https://github.com/apache/parquet-format>parquet
[...]
-Java resources can be build using <code>mvn package</code>. The current stable
version should always be available from Maven Central.</p><p>C++ thrift
resources can be generated via make.</p><p>Thrift can be also code-genned into
any other thrift-supported language.</p></div><div class=td-content
style=page-break-before:always><h1 id=pg-47cac26307c77b16f1b9e75c1e46efec>4.3 -
Contributing to Parquet</h1><div class=lead>How to contribute to
Parquet</div><h2 id=pull-requests>Pull Requests</ [...]
+Java resources can be build using <code>mvn package</code>. The current stable
version should always be available from Maven Central.</p><p>C++ thrift
resources can be generated via make.</p><p>Thrift can be also code-genned into
any other thrift-supported language.</p></div><div class=td-content
style=page-break-before:always><h1 id=pg-47cac26307c77b16f1b9e75c1e46efec>4.3 -
Contributing to Parquet</h1><div class=lead>How to contribute to
Parquet</div><h2 id=pull-requests>Pull Requests</ [...]
git remote add apache https://gitbox.apache.org/repos/asf?p=parquet-mr.git
</code></pre><p>run the following command</p><pre><code>dev/merge_parquet_pr.py
</code></pre><p>example output:</p><pre><code>Which pull request would you
like to merge? (e.g. 34):
@@ -852,7 +852,7 @@ Would you like to pick 485658a5 into another branch? (y/n):
</code></pre><p>For now just say <code>n</code> as we have 1 branch</p><h2
id=website>Website</h2><h3 id=release-documentation>Release
Documentation</h3><p>To create documentation for a new release of
<code>parquet-format</code> create a new <releasenumber>.md file under
<code>content/en/blog/parquet-format</code>. Please see existing files in that
directory as an example.</p><p>To create documentation for a new release of
<code>parquet-mr</code> create a new <releasenumber>.md file unde [...]
job in the <a
href=https://github.com/apache/parquet-site/blob/staging/.github/workflows/deploy.yml>deployment
workflow</a> will be run, populating the <code>asf-staging</code> branch on
this repo with the necessary files.</li></ol><p><strong>Do not directly edit
the <code>asf-staging</code> branch of this repo</strong></p><h4
id=production>Production</h4><p>To make a change to the <code>production</code>
version of the website:</p><ol><li>Make a PR against the
<code>production</code> br [...]
job in the <a
href=https://github.com/apache/parquet-site/blob/production/.github/workflows/deploy.yml>deployment
workflow</a> will be run, populating the <code>asf-site</code> branch on this
repo with the necessary files.</li></ol><p><strong>Do not directly edit the
<code>asf-site</code> branch of this repo</strong></p></div><div
class=td-content style=page-break-before:always><h1
id=pg-d65ca0c6c1ffbeb20627c4d33e7e3dc4>4.4 - Releasing Parquet</h1><div
class=lead>How to release Parquet</ [...]
-</code></pre><p>If you have problems, read the <a
href=https://www.apache.org/dev/publishing-maven-artifacts.html>publishing
Maven artifacts documentation</a></p><h3 id=release-process>Release
process</h3><p>Parquet uses the maven-release-plugin to tag a release and push
binary artifacts to staging in Nexus. Once maven completes the release, the
offical source tarball is built from the tag.</p><p>Before you start the
release process:</p><ol><li>Verify that the release is finished (no pla [...]
+</code></pre><p>If you have problems, read the <a
href=https://www.apache.org/dev/publishing-maven-artifacts.html>publishing
Maven artifacts documentation</a></p><h3 id=release-process>Release
process</h3><p>Parquet uses the maven-release-plugin to tag a release and push
binary artifacts to staging in Nexus. Once maven completes the release, the
official source tarball is built from the tag.</p><p>Before you start the
release process:</p><ol><li>Verify that the release is finished (no pl [...]
</code></pre><p>This runs maven’s release prepare with a consistent tag name.
After this step, the release tag will exist in the git repository.</p><p>If
this step fails, you can roll back the changes by running these
commands.</p><pre><code>find ./ -type f -name '*.releaseBackup' -exec rm {} \;
find ./ -type f -name 'pom.xml' -exec git checkout {} \;
</code></pre><h4 id=2-run-releaseperform-to-stage-binaries>2. Run
release:perform to stage binaries</h4><pre><code>mvn release:perform
@@ -895,7 +895,7 @@ svn co https://dist.apache.org/repos/dist/release/parquet
releases
</code></pre><p>Then add and commit the release artifacts:</p><pre><code>cd
releases
svn add apache-parquet-<version>
svn ci -m "Parquet: Add release <VERSION>"
-</code></pre><h4 id=4-update-parquetapacheorg>4. Update
parquet.apache.org</h4><p>Update the downloads page on parquet.apache.org.
Instructions for updating the site are on the <a
href=http://parquet.apache.org/docs/contribution-guidelines/contributing/>contribution
page</a>.</p><h4
id=5-send-an-announce-e-mail-to-announceapacheorgmailtoannounceapacheorg-and-the-dev-list>5.
Send an ANNOUNCE e-mail to <a
href=mailto:[email protected]>[email protected]</a> and the dev
list</h4><pre><co [...]
+</code></pre><h4 id=4-update-parquetapacheorg>4. Update
parquet.apache.org</h4><p>Update the downloads page on parquet.apache.org.
Instructions for updating the site are on the <a
href=https://parquet.apache.org/docs/contribution-guidelines/contributing/>contribution
page</a>.</p><h4
id=5-send-an-announce-e-mail-to-announceapacheorgmailtoannounceapacheorg-and-the-dev-list>5.
Send an ANNOUNCE e-mail to <a
href=mailto:[email protected]>[email protected]</a> and the dev
list</h4><pre><c [...]
I'm please to announce the release of Parquet <VERSION>!
diff --git a/output/docs/contribution-guidelines/_print/index.html
b/output/docs/contribution-guidelines/_print/index.html
index fd18d19..e8ec557 100644
--- a/output/docs/contribution-guidelines/_print/index.html
+++ b/output/docs/contribution-guidelines/_print/index.html
@@ -5,7 +5,7 @@
"><meta name=twitter:card content="summary"><meta name=twitter:title
content="Developer Guide"><meta name=twitter:description content="All developer
resources related to Parquet.
"><link rel=preload
href=/scss/main.min.c57c6f762acece9f1b5c78f58a507b2e5ff04d2c4c951e6678debf7d71d25341.css
as=style><link
href=/scss/main.min.c57c6f762acece9f1b5c78f58a507b2e5ff04d2c4c951e6678debf7d71d25341.css
rel=stylesheet integrity><script
src=https://code.jquery.com/jquery-3.6.3.min.js
integrity="sha512-STof4xm1wgkfm7heWqFJVn58Hm3EtS31XFaagaa8VMReCXAkQnJZ+jEy8PCC/iT18dFy95WcExNHFTqLyp72eQ=="
crossorigin=anonymous></script><link rel=stylesheet
href=https://cdn.jsdelivr.net/npm/@doc [...]
<a href=# onclick="return print(),!1">Click here to print</a>.</p><p><a
href=/docs/contribution-guidelines/>Return to the regular view of this
page</a>.</p></div><h1 class=title>Developer Guide</h1><div class=lead>All
developer resources related to Parquet.</div><ul><li>1: <a
href=#pg-f674d9b9519c822d1a39419a8e9167a5>Modules</a></li><li>2: <a
href=#pg-0fc7677c5a8dcd5250334bbf678cb165>Building Parquet</a></li><li>3: <a
href=#pg-47cac26307c77b16f1b9e75c1e46efec>Contributing to Parquet</a>< [...]
-Java resources can be build using <code>mvn package</code>. The current stable
version should always be available from Maven Central.</p><p>C++ thrift
resources can be generated via make.</p><p>Thrift can be also code-genned into
any other thrift-supported language.</p></div><div class=td-content
style=page-break-before:always><h1 id=pg-47cac26307c77b16f1b9e75c1e46efec>3 -
Contributing to Parquet</h1><div class=lead>How to contribute to
Parquet</div><h2 id=pull-requests>Pull Requests</h2 [...]
+Java resources can be build using <code>mvn package</code>. The current stable
version should always be available from Maven Central.</p><p>C++ thrift
resources can be generated via make.</p><p>Thrift can be also code-genned into
any other thrift-supported language.</p></div><div class=td-content
style=page-break-before:always><h1 id=pg-47cac26307c77b16f1b9e75c1e46efec>3 -
Contributing to Parquet</h1><div class=lead>How to contribute to
Parquet</div><h2 id=pull-requests>Pull Requests</h2 [...]
git remote add apache https://gitbox.apache.org/repos/asf?p=parquet-mr.git
</code></pre><p>run the following command</p><pre><code>dev/merge_parquet_pr.py
</code></pre><p>example output:</p><pre><code>Which pull request would you
like to merge? (e.g. 34):
@@ -50,7 +50,7 @@ Would you like to pick 485658a5 into another branch? (y/n):
</code></pre><p>For now just say <code>n</code> as we have 1 branch</p><h2
id=website>Website</h2><h3 id=release-documentation>Release
Documentation</h3><p>To create documentation for a new release of
<code>parquet-format</code> create a new <releasenumber>.md file under
<code>content/en/blog/parquet-format</code>. Please see existing files in that
directory as an example.</p><p>To create documentation for a new release of
<code>parquet-mr</code> create a new <releasenumber>.md file unde [...]
job in the <a
href=https://github.com/apache/parquet-site/blob/staging/.github/workflows/deploy.yml>deployment
workflow</a> will be run, populating the <code>asf-staging</code> branch on
this repo with the necessary files.</li></ol><p><strong>Do not directly edit
the <code>asf-staging</code> branch of this repo</strong></p><h4
id=production>Production</h4><p>To make a change to the <code>production</code>
version of the website:</p><ol><li>Make a PR against the
<code>production</code> br [...]
job in the <a
href=https://github.com/apache/parquet-site/blob/production/.github/workflows/deploy.yml>deployment
workflow</a> will be run, populating the <code>asf-site</code> branch on this
repo with the necessary files.</li></ol><p><strong>Do not directly edit the
<code>asf-site</code> branch of this repo</strong></p></div><div
class=td-content style=page-break-before:always><h1
id=pg-d65ca0c6c1ffbeb20627c4d33e7e3dc4>4 - Releasing Parquet</h1><div
class=lead>How to release Parquet</di [...]
-</code></pre><p>If you have problems, read the <a
href=https://www.apache.org/dev/publishing-maven-artifacts.html>publishing
Maven artifacts documentation</a></p><h3 id=release-process>Release
process</h3><p>Parquet uses the maven-release-plugin to tag a release and push
binary artifacts to staging in Nexus. Once maven completes the release, the
offical source tarball is built from the tag.</p><p>Before you start the
release process:</p><ol><li>Verify that the release is finished (no pla [...]
+</code></pre><p>If you have problems, read the <a
href=https://www.apache.org/dev/publishing-maven-artifacts.html>publishing
Maven artifacts documentation</a></p><h3 id=release-process>Release
process</h3><p>Parquet uses the maven-release-plugin to tag a release and push
binary artifacts to staging in Nexus. Once maven completes the release, the
official source tarball is built from the tag.</p><p>Before you start the
release process:</p><ol><li>Verify that the release is finished (no pl [...]
</code></pre><p>This runs maven’s release prepare with a consistent tag name.
After this step, the release tag will exist in the git repository.</p><p>If
this step fails, you can roll back the changes by running these
commands.</p><pre><code>find ./ -type f -name '*.releaseBackup' -exec rm {} \;
find ./ -type f -name 'pom.xml' -exec git checkout {} \;
</code></pre><h4 id=2-run-releaseperform-to-stage-binaries>2. Run
release:perform to stage binaries</h4><pre><code>mvn release:perform
@@ -93,7 +93,7 @@ svn co https://dist.apache.org/repos/dist/release/parquet
releases
</code></pre><p>Then add and commit the release artifacts:</p><pre><code>cd
releases
svn add apache-parquet-<version>
svn ci -m "Parquet: Add release <VERSION>"
-</code></pre><h4 id=4-update-parquetapacheorg>4. Update
parquet.apache.org</h4><p>Update the downloads page on parquet.apache.org.
Instructions for updating the site are on the <a
href=http://parquet.apache.org/docs/contribution-guidelines/contributing/>contribution
page</a>.</p><h4
id=5-send-an-announce-e-mail-to-announceapacheorgmailtoannounceapacheorg-and-the-dev-list>5.
Send an ANNOUNCE e-mail to <a
href=mailto:[email protected]>[email protected]</a> and the dev
list</h4><pre><co [...]
+</code></pre><h4 id=4-update-parquetapacheorg>4. Update
parquet.apache.org</h4><p>Update the downloads page on parquet.apache.org.
Instructions for updating the site are on the <a
href=https://parquet.apache.org/docs/contribution-guidelines/contributing/>contribution
page</a>.</p><h4
id=5-send-an-announce-e-mail-to-announceapacheorgmailtoannounceapacheorg-and-the-dev-list>5.
Send an ANNOUNCE e-mail to <a
href=mailto:[email protected]>[email protected]</a> and the dev
list</h4><pre><c [...]
I'm please to announce the release of Parquet <VERSION>!
diff --git a/output/docs/contribution-guidelines/contributing/index.html
b/output/docs/contribution-guidelines/contributing/index.html
index f435000..d3ac8b7 100644
--- a/output/docs/contribution-guidelines/contributing/index.html
+++ b/output/docs/contribution-guidelines/contributing/index.html
@@ -1,13 +1,13 @@
<!doctype html><html itemscope itemtype=http://schema.org/WebPage lang=en
class=no-js><head><meta charset=utf-8><meta name=viewport
content="width=device-width,initial-scale=1,shrink-to-fit=no"><meta name=robots
content="index, follow"><link rel="shortcut icon"
href=/favicons/favicon.ico><link rel=apple-touch-icon
href=/favicons/apple-touch-icon-180x180.png sizes=180x180><link rel=icon
type=image/png href=/favicons/favicon-16x16.png sizes=16x16><link rel=icon
type=image/png href=/favicon [...]
<meta name=description content="How to contribute to Parquet
"><meta property="og:title" content="Contributing to Parquet"><meta
property="og:description" content="How to contribute to Parquet
-"><meta property="og:type" content="article"><meta property="og:url"
content="/docs/contribution-guidelines/contributing/"><meta
property="article:section" content="docs"><meta
property="article:modified_time" content="2022-07-09T16:12:08-07:00"><meta
property="og:site_name" content="Apache Parquet"><meta itemprop=name
content="Contributing to Parquet"><meta itemprop=description content="How to
contribute to Parquet
-"><meta itemprop=dateModified content="2022-07-09T16:12:08-07:00"><meta
itemprop=wordCount content="680"><meta itemprop=keywords content><meta
name=twitter:card content="summary"><meta name=twitter:title
content="Contributing to Parquet"><meta name=twitter:description content="How
to contribute to Parquet
+"><meta property="og:type" content="article"><meta property="og:url"
content="/docs/contribution-guidelines/contributing/"><meta
property="article:section" content="docs"><meta
property="article:modified_time" content="2024-03-11T22:11:10+01:00"><meta
property="og:site_name" content="Apache Parquet"><meta itemprop=name
content="Contributing to Parquet"><meta itemprop=description content="How to
contribute to Parquet
+"><meta itemprop=dateModified content="2024-03-11T22:11:10+01:00"><meta
itemprop=wordCount content="680"><meta itemprop=keywords content><meta
name=twitter:card content="summary"><meta name=twitter:title
content="Contributing to Parquet"><meta name=twitter:description content="How
to contribute to Parquet
"><link rel=preload
href=/scss/main.min.c57c6f762acece9f1b5c78f58a507b2e5ff04d2c4c951e6678debf7d71d25341.css
as=style><link
href=/scss/main.min.c57c6f762acece9f1b5c78f58a507b2e5ff04d2c4c951e6678debf7d71d25341.css
rel=stylesheet integrity><script
src=https://code.jquery.com/jquery-3.6.3.min.js
integrity="sha512-STof4xm1wgkfm7heWqFJVn58Hm3EtS31XFaagaa8VMReCXAkQnJZ+jEy8PCC/iT18dFy95WcExNHFTqLyp72eQ=="
crossorigin=anonymous></script><link rel=stylesheet
href=https://cdn.jsdelivr.net/npm/@doc [...]
<a
href=https://github.com/apache/parquet-site/edit/production/content/en/docs/Contribution%20Guidelines/contributing.md
class="td-page-meta--edit td-page-meta__edit" target=_blank rel=noopener><i
class="fa-solid fa-pen-to-square fa-fw"></i> Edit this page</a>
<a
href="https://github.com/apache/parquet-site/new/production/content/en/docs/Contribution%20Guidelines?filename=change-me.md&value=---%0Atitle%3A+%22Long+Page+Title%22%0AlinkTitle%3A+%22Short+Nav+Title%22%0Aweight%3A+100%0Adescription%3A+%3E-%0A+++++Page+description+for+heading+and+indexes.%0A---%0A%0A%23%23+Heading%0A%0AEdit+this+template+to+create+your+new+page.%0A%0A%2A+Give+it+a+good+name%2C+ending+in+%60.md%60+-+e.g.+%60getting-started.md%60%0A%2A+Edit+the+%22front+matter%22+s
[...]
<a
href="https://github.com/apache/parquet-site/issues/new?title=Contributing%20to%20Parquet"
class="td-page-meta--issue td-page-meta__issue" target=_blank rel=noopener><i
class="fa-solid fa-list-check fa-fw"></i> Create documentation issue</a>
-<a id=print href=/docs/contribution-guidelines/_print/><i class="fa-solid
fa-print fa-fw"></i> Print entire section</a></div><div class=td-toc><nav
id=TableOfContents><ul><li><a href=#pull-requests>Pull Requests</a></li><li><a
href=#committers>Committers</a></li><li><a href=#website>Website</a><ul><li><a
href=#release-documentation>Release Documentation</a></li><li><a
href=#website-development-and-deployment>Website development and
deployment</a></li></ul></li></ul></nav></div></aside><m [...]
+<a id=print href=/docs/contribution-guidelines/_print/><i class="fa-solid
fa-print fa-fw"></i> Print entire section</a></div><div class=td-toc><nav
id=TableOfContents><ul><li><a href=#pull-requests>Pull Requests</a></li><li><a
href=#committers>Committers</a></li><li><a href=#website>Website</a><ul><li><a
href=#release-documentation>Release Documentation</a></li><li><a
href=#website-development-and-deployment>Website development and
deployment</a></li></ul></li></ul></nav></div></aside><m [...]
git remote add apache https://gitbox.apache.org/repos/asf?p=parquet-mr.git
</code></pre><p>run the following command</p><pre><code>dev/merge_parquet_pr.py
</code></pre><p>example output:</p><pre><code>Which pull request would you
like to merge? (e.g. 34):
@@ -51,6 +51,6 @@ Merge hash: 485658a5
Would you like to pick 485658a5 into another branch? (y/n):
</code></pre><p>For now just say <code>n</code> as we have 1 branch</p><h2
id=website>Website</h2><h3 id=release-documentation>Release
Documentation</h3><p>To create documentation for a new release of
<code>parquet-format</code> create a new <releasenumber>.md file under
<code>content/en/blog/parquet-format</code>. Please see existing files in that
directory as an example.</p><p>To create documentation for a new release of
<code>parquet-mr</code> create a new <releasenumber>.md file unde [...]
job in the <a
href=https://github.com/apache/parquet-site/blob/staging/.github/workflows/deploy.yml>deployment
workflow</a> will be run, populating the <code>asf-staging</code> branch on
this repo with the necessary files.</li></ol><p><strong>Do not directly edit
the <code>asf-staging</code> branch of this repo</strong></p><h4
id=production>Production</h4><p>To make a change to the <code>production</code>
version of the website:</p><ol><li>Make a PR against the
<code>production</code> br [...]
-job in the <a
href=https://github.com/apache/parquet-site/blob/production/.github/workflows/deploy.yml>deployment
workflow</a> will be run, populating the <code>asf-site</code> branch on this
repo with the necessary files.</li></ol><p><strong>Do not directly edit the
<code>asf-site</code> branch of this repo</strong></p><div
class=td-page-meta__lastmod>Last modified July 9, 2022: <a
href=https://github.com/apache/parquet-site/commit/227318cd2ead187eda1cd2cac80b0636c4c3779c>updated
stale [...]
+job in the <a
href=https://github.com/apache/parquet-site/blob/production/.github/workflows/deploy.yml>deployment
workflow</a> will be run, populating the <code>asf-site</code> branch on this
repo with the necessary files.</li></ol><p><strong>Do not directly edit the
<code>asf-site</code> branch of this repo</strong></p><div
class=td-page-meta__lastmod>Last modified March 11, 2024: <a
href=https://github.com/apache/parquet-site/commit/e79b30489c6bd50f0829a5f2b87f4a26f5e4af05>Fix
typos (# [...]
2024
<span class=td-footer__authors>Apache Parquet</span></span><span
class=td-footer__all_rights_reserved>All Rights Reserved</span><span
class=ms-2><a href=https://policies.google.com/privacy target=_blank
rel=noopener>Privacy Policy</a></span></div></div></div></footer></div><script
src=/js/main.min.1f48fc7981e4db829114650dc98d270b6642a46c1e4ebddb8389ff0a463a6328.js
integrity="sha256-H0j8eYHk24KRFGUNyY0nC2ZCpGweTr3bg4n/CkY6Yyg="
crossorigin=anonymous></script><script defer src=/js/click-to [...]
\ No newline at end of file
diff --git a/output/docs/contribution-guidelines/index.xml
b/output/docs/contribution-guidelines/index.xml
index 06773df..3d1eefa 100644
--- a/output/docs/contribution-guidelines/index.xml
+++ b/output/docs/contribution-guidelines/index.xml
@@ -20,7 +20,7 @@ Java resources can be build using <code>mvn
package</code>. The current st
</ol>
<p>If you’d like to report a bug but don’t have time to fix it, you can
still post it to our <a
href="https://issues.apache.org/jira/browse/PARQUET">issue tracker</a>, or
email the mailing list (<a
href="mailto:[email protected]">[email protected]</a>).</p>
<h2 id="committers">Committers</h2>
-<p>Merging a pull request requires being a comitter on the project.</p>
+<p>Merging a pull request requires being a committer on the project.</p>
<p>How to merge a Pull request (have an apache and github-apache remote
setup):</p>
<pre><code>git remote add github-apache
[email protected]:apache/parquet-mr.git
git remote add apache https://gitbox.apache.org/repos/asf?p=parquet-mr.git
@@ -97,7 +97,7 @@ job in the <a
href="https://github.com/apache/parquet-site/blob/production/.g
</code></pre>
<p>If you have problems, read the <a
href="https://www.apache.org/dev/publishing-maven-artifacts.html">publishing
Maven artifacts documentation</a></p>
<h3 id="release-process">Release process</h3>
-<p>Parquet uses the maven-release-plugin to tag a release and push binary
artifacts to staging in Nexus. Once maven completes the release, the offical
source tarball is built from the tag.</p>
+<p>Parquet uses the maven-release-plugin to tag a release and push binary
artifacts to staging in Nexus. Once maven completes the release, the official
source tarball is built from the tag.</p>
<p>Before you start the release process:</p>
<ol>
<li>Verify that the release is finished (no planned JIRAs are pending and
all patches are cherry-picked to the release branch)</li>
@@ -190,7 +190,7 @@ svn add apache-parquet-&lt;version&gt;
svn ci -m &quot;Parquet: Add release &lt;VERSION&gt;&quot;
</code></pre>
<h4 id="4-update-parquetapacheorg">4. Update parquet.apache.org</h4>
-<p>Update the downloads page on parquet.apache.org. Instructions for
updating the site are on the <a
href="http://parquet.apache.org/docs/contribution-guidelines/contributing/">contribution
page</a>.</p>
+<p>Update the downloads page on parquet.apache.org. Instructions for
updating the site are on the <a
href="https://parquet.apache.org/docs/contribution-guidelines/contributing/">contribution
page</a>.</p>
<h4
id="5-send-an-announce-e-mail-to-announceapacheorgmailtoannounceapacheorg-and-the-dev-list">5.
Send an ANNOUNCE e-mail to <a
href="mailto:[email protected]">[email protected]</a> and the dev
list</h4>
<pre><code>[ANNOUNCE] Apache Parquet release &lt;VERSION&gt;
I'm please to announce the release of Parquet &lt;VERSION&gt;!
diff --git a/output/docs/contribution-guidelines/releasing/index.html
b/output/docs/contribution-guidelines/releasing/index.html
index 5dac601..9cfb2b8 100644
--- a/output/docs/contribution-guidelines/releasing/index.html
+++ b/output/docs/contribution-guidelines/releasing/index.html
@@ -1,14 +1,14 @@
<!doctype html><html itemscope itemtype=http://schema.org/WebPage lang=en
class=no-js><head><meta charset=utf-8><meta name=viewport
content="width=device-width,initial-scale=1,shrink-to-fit=no"><meta name=robots
content="index, follow"><link rel="shortcut icon"
href=/favicons/favicon.ico><link rel=apple-touch-icon
href=/favicons/apple-touch-icon-180x180.png sizes=180x180><link rel=icon
type=image/png href=/favicons/favicon-16x16.png sizes=16x16><link rel=icon
type=image/png href=/favicon [...]
<meta name=description content="How to release Parquet
"><meta property="og:title" content="Releasing Parquet"><meta
property="og:description" content="How to release Parquet
-"><meta property="og:type" content="article"><meta property="og:url"
content="/docs/contribution-guidelines/releasing/"><meta
property="article:section" content="docs"><meta
property="article:modified_time" content="2023-04-06T10:48:15+08:00"><meta
property="og:site_name" content="Apache Parquet"><meta itemprop=name
content="Releasing Parquet"><meta itemprop=description content="How to release
Parquet
-"><meta itemprop=dateModified content="2023-04-06T10:48:15+08:00"><meta
itemprop=wordCount content="822"><meta itemprop=keywords content><meta
name=twitter:card content="summary"><meta name=twitter:title content="Releasing
Parquet"><meta name=twitter:description content="How to release Parquet
+"><meta property="og:type" content="article"><meta property="og:url"
content="/docs/contribution-guidelines/releasing/"><meta
property="article:section" content="docs"><meta
property="article:modified_time" content="2024-03-11T22:11:10+01:00"><meta
property="og:site_name" content="Apache Parquet"><meta itemprop=name
content="Releasing Parquet"><meta itemprop=description content="How to release
Parquet
+"><meta itemprop=dateModified content="2024-03-11T22:11:10+01:00"><meta
itemprop=wordCount content="822"><meta itemprop=keywords content><meta
name=twitter:card content="summary"><meta name=twitter:title content="Releasing
Parquet"><meta name=twitter:description content="How to release Parquet
"><link rel=preload
href=/scss/main.min.c57c6f762acece9f1b5c78f58a507b2e5ff04d2c4c951e6678debf7d71d25341.css
as=style><link
href=/scss/main.min.c57c6f762acece9f1b5c78f58a507b2e5ff04d2c4c951e6678debf7d71d25341.css
rel=stylesheet integrity><script
src=https://code.jquery.com/jquery-3.6.3.min.js
integrity="sha512-STof4xm1wgkfm7heWqFJVn58Hm3EtS31XFaagaa8VMReCXAkQnJZ+jEy8PCC/iT18dFy95WcExNHFTqLyp72eQ=="
crossorigin=anonymous></script><link rel=stylesheet
href=https://cdn.jsdelivr.net/npm/@doc [...]
<a
href=https://github.com/apache/parquet-site/edit/production/content/en/docs/Contribution%20Guidelines/releasing.md
class="td-page-meta--edit td-page-meta__edit" target=_blank rel=noopener><i
class="fa-solid fa-pen-to-square fa-fw"></i> Edit this page</a>
<a
href="https://github.com/apache/parquet-site/new/production/content/en/docs/Contribution%20Guidelines?filename=change-me.md&value=---%0Atitle%3A+%22Long+Page+Title%22%0AlinkTitle%3A+%22Short+Nav+Title%22%0Aweight%3A+100%0Adescription%3A+%3E-%0A+++++Page+description+for+heading+and+indexes.%0A---%0A%0A%23%23+Heading%0A%0AEdit+this+template+to+create+your+new+page.%0A%0A%2A+Give+it+a+good+name%2C+ending+in+%60.md%60+-+e.g.+%60getting-started.md%60%0A%2A+Edit+the+%22front+matter%22+s
[...]
<a
href="https://github.com/apache/parquet-site/issues/new?title=Releasing%20Parquet"
class="td-page-meta--issue td-page-meta__issue" target=_blank rel=noopener><i
class="fa-solid fa-list-check fa-fw"></i> Create documentation issue</a>
<a id=print href=/docs/contribution-guidelines/_print/><i class="fa-solid
fa-print fa-fw"></i> Print entire section</a></div><div class=td-toc><nav
id=TableOfContents><ul><li><ul><li><a href=#setup>Setup</a></li><li><a
href=#release-process>Release process</a></li><li><a
href=#publishing-after-the-vote-passes>Publishing after the vote
passes</a></li></ul></li></ul></nav></div></aside><main class="col-12 col-md-9
col-xl-8 ps-md-5" role=main><nav aria-label=breadcrumb class=td-breadcrumbs>
[...]
-</code></pre><p>If you have problems, read the <a
href=https://www.apache.org/dev/publishing-maven-artifacts.html>publishing
Maven artifacts documentation</a></p><h3 id=release-process>Release
process</h3><p>Parquet uses the maven-release-plugin to tag a release and push
binary artifacts to staging in Nexus. Once maven completes the release, the
offical source tarball is built from the tag.</p><p>Before you start the
release process:</p><ol><li>Verify that the release is finished (no pla [...]
+</code></pre><p>If you have problems, read the <a
href=https://www.apache.org/dev/publishing-maven-artifacts.html>publishing
Maven artifacts documentation</a></p><h3 id=release-process>Release
process</h3><p>Parquet uses the maven-release-plugin to tag a release and push
binary artifacts to staging in Nexus. Once maven completes the release, the
official source tarball is built from the tag.</p><p>Before you start the
release process:</p><ol><li>Verify that the release is finished (no pl [...]
</code></pre><p>This runs maven’s release prepare with a consistent tag name.
After this step, the release tag will exist in the git repository.</p><p>If
this step fails, you can roll back the changes by running these
commands.</p><pre><code>find ./ -type f -name '*.releaseBackup' -exec rm {} \;
find ./ -type f -name 'pom.xml' -exec git checkout {} \;
</code></pre><h4 id=2-run-releaseperform-to-stage-binaries>2. Run
release:perform to stage binaries</h4><pre><code>mvn release:perform
@@ -51,7 +51,7 @@ svn co https://dist.apache.org/repos/dist/release/parquet
releases
</code></pre><p>Then add and commit the release artifacts:</p><pre><code>cd
releases
svn add apache-parquet-<version>
svn ci -m "Parquet: Add release <VERSION>"
-</code></pre><h4 id=4-update-parquetapacheorg>4. Update
parquet.apache.org</h4><p>Update the downloads page on parquet.apache.org.
Instructions for updating the site are on the <a
href=http://parquet.apache.org/docs/contribution-guidelines/contributing/>contribution
page</a>.</p><h4
id=5-send-an-announce-e-mail-to-announceapacheorgmailtoannounceapacheorg-and-the-dev-list>5.
Send an ANNOUNCE e-mail to <a
href=mailto:[email protected]>[email protected]</a> and the dev
list</h4><pre><co [...]
+</code></pre><h4 id=4-update-parquetapacheorg>4. Update
parquet.apache.org</h4><p>Update the downloads page on parquet.apache.org.
Instructions for updating the site are on the <a
href=https://parquet.apache.org/docs/contribution-guidelines/contributing/>contribution
page</a>.</p><h4
id=5-send-an-announce-e-mail-to-announceapacheorgmailtoannounceapacheorg-and-the-dev-list>5.
Send an ANNOUNCE e-mail to <a
href=mailto:[email protected]>[email protected]</a> and the dev
list</h4><pre><c [...]
I'm please to announce the release of Parquet <VERSION>!
@@ -67,6 +67,6 @@ This release can be downloaded from:
https://parquet.apache.org/downloads/
Java artifacts are available from Maven Central.
Thanks to everyone for contributing!
-</code></pre><div class=td-page-meta__lastmod>Last modified April 6, 2023: <a
href=https://github.com/apache/parquet-site/commit/aa23de1f6d256449e7e5052c8603291a625d7683>Release
1.13.0 (aa23de1)</a></div></div></main></div></div><footer class="td-footer
row d-print-none"><div class=container-fluid><div class="row mx-md-2"><div
class="td-footer__left col-6 col-sm-4 order-sm-1"><ul
class=td-footer__links-list><li class=td-footer__links-item
data-bs-toggle=tooltip title="User mailing list" [...]
+</code></pre><div class=td-page-meta__lastmod>Last modified March 11, 2024: <a
href=https://github.com/apache/parquet-site/commit/e79b30489c6bd50f0829a5f2b87f4a26f5e4af05>Fix
typos (#46) (e79b304)</a></div></div></main></div></div><footer
class="td-footer row d-print-none"><div class=container-fluid><div class="row
mx-md-2"><div class="td-footer__left col-6 col-sm-4 order-sm-1"><ul
class=td-footer__links-list><li class=td-footer__links-item
data-bs-toggle=tooltip title="User mailing list [...]
2024
<span class=td-footer__authors>Apache Parquet</span></span><span
class=td-footer__all_rights_reserved>All Rights Reserved</span><span
class=ms-2><a href=https://policies.google.com/privacy target=_blank
rel=noopener>Privacy Policy</a></span></div></div></div></footer></div><script
src=/js/main.min.1f48fc7981e4db829114650dc98d270b6642a46c1e4ebddb8389ff0a463a6328.js
integrity="sha256-H0j8eYHk24KRFGUNyY0nC2ZCpGweTr3bg4n/CkY6Yyg="
crossorigin=anonymous></script><script defer src=/js/click-to [...]
\ No newline at end of file
diff --git a/output/docs/file-format/_print/index.html
b/output/docs/file-format/_print/index.html
index 91c9e0c..2cdb3a1 100644
--- a/output/docs/file-format/_print/index.html
+++ b/output/docs/file-format/_print/index.html
@@ -163,7 +163,7 @@ chosen as follows:</p><div class=highlight><pre tabindex=0
style=background-colo
</span></span><span style=display:flex><span><span
style=color:#204a87;font-weight:700>unsigned</span> <span
style=color:#000>int64</span> <span style=color:#000>z_as_64_bit</span> <span
style=color:#ce5c00;font-weight:700>=</span> <span
style=color:#000>z</span><span style=color:#000;font-weight:700>;</span>
</span></span><span style=display:flex><span><span
style=color:#204a87;font-weight:700>unsigned</span> <span
style=color:#000>int32</span> <span style=color:#000>i</span> <span
style=color:#ce5c00;font-weight:700>=</span> <span
style=color:#000;font-weight:700>(</span><span
style=color:#000>h_top_bits</span> <span
style=color:#ce5c00;font-weight:700>*</span> <span
style=color:#000>z_as_64_bit</span><span
style=color:#000;font-weight:700>)</span> <span
style=color:#ce5c00;font-weight:700> [...]
</span></span></code></pre></div><p>The first line extracts the most
significant 32 bits from <code>h</code> and
-assignes them to a 64-bit unsigned integer. The second line is
+assigns them to a 64-bit unsigned integer. The second line is
simpler: it just sets an unsigned 64-bit value to the same value as
the 32-bit unsigned value <code>z</code>. The purpose of having both
<code>h_top_bits</code>
and <code>z_as_64_bit</code> be 64-bit values is so that their product is a
@@ -194,7 +194,7 @@ significant 32 bits.</p><pre tabindex=0><code>void
filter_insert(SBBF filter, un
block b = filter.getBlock(i);
return block_check(b, (unsigned int32)x)
}
-</code></pre><p>The use of blocks is from Putze et al.’s <a
href=http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf>Cache-,
Hash- and
+</code></pre><p>The use of blocks is from Putze et al.’s <a
href=https://www.cs.amherst.edu/~ccmcgeoch/cs34/papers/cacheefficientbloomfilters-jea.pdf>Cache-,
Hash- and
Space-Efficient Bloom
filters</a></p><p>To use an SBBF for values of arbitrary Parquet types, we
apply a hash
function to that value - at the time of writing,
@@ -202,12 +202,12 @@ function to that value - at the time of writing,
with a seed of 0 and <a
href=https://github.com/Cyan4973/xxHash/blob/v0.7.0/doc/xxhash_spec.md>following
the specification version
0.1.1</a>.</p><h4 id=sizing-an-sbbf>Sizing an SBBF</h4><p>The
<code>check</code> operation in SBBFs can return <code>true</code> for an
argument that
was never inserted into the SBBF. These are called “false
-positives”. The “false positive probabilty” is the
probability that
+positives”. The “false positive probability” is the
probability that
any given hash value that was never <code>insert</code>ed into the SBBF will
cause <code>check</code> to return <code>true</code> (a false positive). There
is not a
simple closed-form calculation of this probability, but here is an
example:</p><p>A filter that uses 1024 blocks and has had 26,214 hash values
-<code>insert</code>ed will have a false positive probabilty of around 1.26%.
Each
+<code>insert</code>ed will have a false positive probability of around 1.26%.
Each
of those 1024 blocks occupies 256 bits of space, so the total space
usage is 262,144. That means that the ratio of bits of space to hash
values is 10-to-1. Adding more hash values increases the denominator
@@ -308,7 +308,7 @@ If any ambiguity arises when implementing this format, the
implementation
provided by the <a href=https://zlib.net/>zlib compression library</a> is
authoritative.</p><p>Readers should support reading pages containing multiple
GZIP members, however,
as this has historically not been supported by all implementations, it is
recommended
that writers refrain from creating such pages by default for better
interoperability.</p><h3 id=lzo>LZO</h3><p>A codec based on or interoperable
with the
-<a href=http://www.oberhumer.com/opensource/lzo/>LZO compression
library</a>.</p><h3 id=brotli>BROTLI</h3><p>A codec based on the Brotli format
defined by
+<a href=https://www.oberhumer.com/opensource/lzo/>LZO compression
library</a>.</p><h3 id=brotli>BROTLI</h3><p>A codec based on the Brotli format
defined by
<a href=https://tools.ietf.org/html/rfc7932>RFC 7932</a>.
If any ambiguity arises when implementing this format, the implementation
provided by the <a href=https://github.com/google/brotli>Brotli compression
library</a>
@@ -320,10 +320,10 @@ this compression codec in their user-facing APIs, and
advise users to
switch to the newer, interoperable <code>LZ4_RAW</code> codec.</p><h3
id=zstd>ZSTD</h3><p>A codec based on the Zstandard format defined by
<a href=https://tools.ietf.org/html/rfc8478>RFC 8478</a>. If any ambiguity
arises
when implementing this format, the implementation provided by the
-<a href=https://facebook.github.io/zstd/>ZStandard compression library</a>
+<a href=https://facebook.github.io/zstd/>Zstandard compression library</a>
is authoritative.</p><h3 id=lz4_raw>LZ4_RAW</h3><p>A codec based on the <a
href=https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md>LZ4 block
format</a>.
If any ambiguity arises when implementing this format, the implementation
-provided by the <a href=http://www.lz4.org/>LZ4 compression library</a> is
authoritative.</p></div><div class=td-content
style=page-break-before:always><h1 id=pg-9aa971e751fdd370158d525ad337ef7a>7.2 -
Encodings</h1><p><a name=PLAIN></a></p><h3 id=plain-plain--0>Plain: (PLAIN =
0)</h3><p>Supported Types: all</p><p>This is the plain encoding that must be
supported for types. It is
+provided by the <a href=https://www.lz4.org/>LZ4 compression library</a> is
authoritative.</p></div><div class=td-content
style=page-break-before:always><h1 id=pg-9aa971e751fdd370158d525ad337ef7a>7.2 -
Encodings</h1><p><a name=PLAIN></a></p><h3 id=plain-plain--0>Plain: (PLAIN =
0)</h3><p>Supported Types: all</p><p>This is the plain encoding that must be
supported for types. It is
intended to be the simplest encoding. Values are encoded back to
back.</p><p>The plain encoding is used whenever a more efficient encoding can
not be used. It
stores the data in the following format:</p><ul><li>BOOLEAN: <a
href=/docs/file-format/data-pages/encodings/#BITPACKED>Bit Packed</a>, LSB
first</li><li>INT32: 4 bytes little endian</li><li>INT64: 8 bytes little
endian</li><li>INT96: 12 bytes little endian (deprecated)</li><li>FLOAT: 4
bytes IEEE little endian</li><li>DOUBLE: 8 bytes IEEE little
endian</li><li>BYTE_ARRAY: length in 4 bytes little endian followed by the
bytes contained in the array</li><li>FIXED_LEN_BYTE_ARRAY: the bytes [...]
point types are encoded in IEEE.</p><p>For the byte array type, it encodes the
length as a 4 byte little
@@ -391,7 +391,7 @@ bit label: ABC DEF GHI JKL MNO PQR STU VWX
bit label: ABCDEFGH IJKLMNOP QRSTUVWX
</code></pre><p>Note that the BIT_PACKED encoding method is only supported for
encoding
repetition and definition levels.</p><p><a name=DELTAENC></a></p><h3
id=delta-encoding-delta_binary_packed--5>Delta Encoding (DELTA_BINARY_PACKED =
5)</h3><p>Supported Types: INT32, INT64</p><p>This encoding is adapted from the
Binary packing described in
-<a href=http://arxiv.org/pdf/1209.2137v5.pdf>“Decoding billions of
integers per second through vectorization”</a>
+<a href=https://arxiv.org/pdf/1209.2137v5.pdf>“Decoding billions of
integers per second through vectorization”</a>
by D. Lemire and L. Boytsov.</p><p>In delta encoding we make use of variable
length integers for storing various
numbers (not the deltas themselves). For unsigned values, we use ULEB128,
which is the unsigned version of LEB128 (<a
href=https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128)>https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128)</a>.
@@ -403,7 +403,7 @@ quotient, the number of values in a miniblock, is a
multiple of 32; it is
stored as a ULEB128 int</li><li>the total value count is stored as a ULEB128
int</li><li>the first value is stored as a zigzag ULEB128 int</li></ul><p>Each
block contains</p><pre tabindex=0><code><min delta> <list of bitwidths
of miniblocks> <miniblocks>
</code></pre><ul><li>the min delta is a zigzag ULEB128 int (we compute a
minimum as we need
positive integers for bit packing)</li><li>the bitwidth of each block is
stored as a byte</li><li>each miniblock is a list of bit packed ints according
to the bit width
-stored at the begining of the block</li></ul><p>To encode a block, we
will:</p><ol><li><p>Compute the differences between consecutive elements. For
the first
+stored at the beginning of the block</li></ul><p>To encode a block, we
will:</p><ol><li><p>Compute the differences between consecutive elements. For
the first
element in the block, use the last element in the previous block or, in
the case of the first block, use the first value of the whole sequence,
stored in the header.</p></li><li><p>Compute the frame of reference (the
minimum of the deltas in the block).
@@ -552,7 +552,7 @@ data set (table). This string is optionally passed by a
writer upon file creatio
the AAD prefix is stored in an <code>aad_prefix</code> field in the file, and
is made available to the readers.
This field is not encrypted. If a user is concerned about keeping the file
identity inside the file,
the writer code can explicitly request Parquet not to store the AAD prefix.
Then the aad_prefix field
-will be empty; AAD prefixes must be fully managed by the caller code and
supplied explictly to Parquet
+will be empty; AAD prefixes must be fully managed by the caller code and
supplied explicitly to Parquet
readers for each file.</p><p>The protection against swapping full files is
optional. It is not enabled by default because
it requires the writers to generate and pass an AAD prefix.</p><p>A reader of
a file created with an AAD prefix, should be able to verify the prefix (file
identity)
by comparing it with e.g. the target table name, using a convention accepted
in the organization.
diff --git a/output/docs/file-format/bloomfilter/index.html
b/output/docs/file-format/bloomfilter/index.html
index 943bd18..0176008 100644
--- a/output/docs/file-format/bloomfilter/index.html
+++ b/output/docs/file-format/bloomfilter/index.html
@@ -1,5 +1,5 @@
<!doctype html><html itemscope itemtype=http://schema.org/WebPage lang=en
class=no-js><head><meta charset=utf-8><meta name=viewport
content="width=device-width,initial-scale=1,shrink-to-fit=no"><meta name=robots
content="index, follow"><link rel="shortcut icon"
href=/favicons/favicon.ico><link rel=apple-touch-icon
href=/favicons/apple-touch-icon-180x180.png sizes=180x180><link rel=icon
type=image/png href=/favicons/favicon-16x16.png sizes=16x16><link rel=icon
type=image/png href=/favicon [...]
-<meta name=description content="Problem statement In their current format,
column statistics and dictionaries can be used for predicate pushdown.
Statistics include minimum and maximum value, which can be used to filter out
values not in the range. Dictionaries are more specific, and readers can filter
out values that are between min and max but not in the dictionary. However,
when there are too many distinct values, writers sometimes choose not to add
dictionaries because of the extra s [...]
+<meta name=description content="Problem statement In their current format,
column statistics and dictionaries can be used for predicate pushdown.
Statistics include minimum and maximum value, which can be used to filter out
values not in the range. Dictionaries are more specific, and readers can filter
out values that are between min and max but not in the dictionary. However,
when there are too many distinct values, writers sometimes choose not to add
dictionaries because of the extra s [...]
<a
href=https://github.com/apache/parquet-site/edit/production/content/en/docs/File%20Format/bloomfilter.md
class="td-page-meta--edit td-page-meta__edit" target=_blank rel=noopener><i
class="fa-solid fa-pen-to-square fa-fw"></i> Edit this page</a>
<a
href="https://github.com/apache/parquet-site/new/production/content/en/docs/File%20Format?filename=change-me.md&value=---%0Atitle%3A+%22Long+Page+Title%22%0AlinkTitle%3A+%22Short+Nav+Title%22%0Aweight%3A+100%0Adescription%3A+%3E-%0A+++++Page+description+for+heading+and+indexes.%0A---%0A%0A%23%23+Heading%0A%0AEdit+this+template+to+create+your+new+page.%0A%0A%2A+Give+it+a+good+name%2C+ending+in+%60.md%60+-+e.g.+%60getting-started.md%60%0A%2A+Edit+the+%22front+matter%22+section+at+th
[...]
<a
href="https://github.com/apache/parquet-site/issues/new?title=Bloom%20Filter"
class="td-page-meta--issue td-page-meta__issue" target=_blank rel=noopener><i
class="fa-solid fa-list-check fa-fw"></i> Create documentation issue</a>
@@ -104,7 +104,7 @@ chosen as follows:</p><div class=highlight><pre tabindex=0
style=background-colo
</span></span><span style=display:flex><span><span
style=color:#204a87;font-weight:700>unsigned</span> <span
style=color:#000>int64</span> <span style=color:#000>z_as_64_bit</span> <span
style=color:#ce5c00;font-weight:700>=</span> <span
style=color:#000>z</span><span style=color:#000;font-weight:700>;</span>
</span></span><span style=display:flex><span><span
style=color:#204a87;font-weight:700>unsigned</span> <span
style=color:#000>int32</span> <span style=color:#000>i</span> <span
style=color:#ce5c00;font-weight:700>=</span> <span
style=color:#000;font-weight:700>(</span><span
style=color:#000>h_top_bits</span> <span
style=color:#ce5c00;font-weight:700>*</span> <span
style=color:#000>z_as_64_bit</span><span
style=color:#000;font-weight:700>)</span> <span
style=color:#ce5c00;font-weight:700> [...]
</span></span></code></pre></div><p>The first line extracts the most
significant 32 bits from <code>h</code> and
-assignes them to a 64-bit unsigned integer. The second line is
+assigns them to a 64-bit unsigned integer. The second line is
simpler: it just sets an unsigned 64-bit value to the same value as
the 32-bit unsigned value <code>z</code>. The purpose of having both
<code>h_top_bits</code>
and <code>z_as_64_bit</code> be 64-bit values is so that their product is a
@@ -135,7 +135,7 @@ significant 32 bits.</p><pre tabindex=0><code>void
filter_insert(SBBF filter, un
block b = filter.getBlock(i);
return block_check(b, (unsigned int32)x)
}
-</code></pre><p>The use of blocks is from Putze et al.’s <a
href=http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf>Cache-,
Hash- and
+</code></pre><p>The use of blocks is from Putze et al.’s <a
href=https://www.cs.amherst.edu/~ccmcgeoch/cs34/papers/cacheefficientbloomfilters-jea.pdf>Cache-,
Hash- and
Space-Efficient Bloom
filters</a></p><p>To use an SBBF for values of arbitrary Parquet types, we
apply a hash
function to that value - at the time of writing,
@@ -143,12 +143,12 @@ function to that value - at the time of writing,
with a seed of 0 and <a
href=https://github.com/Cyan4973/xxHash/blob/v0.7.0/doc/xxhash_spec.md>following
the specification version
0.1.1</a>.</p><h4 id=sizing-an-sbbf>Sizing an SBBF</h4><p>The
<code>check</code> operation in SBBFs can return <code>true</code> for an
argument that
was never inserted into the SBBF. These are called “false
-positives”. The “false positive probabilty” is the
probability that
+positives”. The “false positive probability” is the
probability that
any given hash value that was never <code>insert</code>ed into the SBBF will
cause <code>check</code> to return <code>true</code> (a false positive). There
is not a
simple closed-form calculation of this probability, but here is an
example:</p><p>A filter that uses 1024 blocks and has had 26,214 hash values
-<code>insert</code>ed will have a false positive probabilty of around 1.26%.
Each
+<code>insert</code>ed will have a false positive probability of around 1.26%.
Each
of those 1024 blocks occupies 256 bits of space, so the total space
usage is 262,144. That means that the ratio of bits of space to hash
values is 10-to-1. Adding more hash values increases the denominator
@@ -224,6 +224,6 @@ serialized Bitset.</p><p>For Bloom filters in sensitive
columns, each of the two
serialization, and then written to the file. The encryption will be performed
using the AES GCM
cipher, with the same column key, but with different AAD module types -
“BloomFilter Header” (8)
and “BloomFilter Bitset” (9). The length of the encrypted buffer
is written before the buffer, as
-described in the Parquet encryption specification.</p><div
class=td-page-meta__lastmod>Last modified January 14, 2024: <a
href=https://github.com/apache/parquet-site/commit/7cf58a9ec47d96608dfec9771179691301ede3ce>Sync
site with format release v2.10.0
(7cf58a9)</a></div></div></main></div></div><footer class="td-footer row
d-print-none"><div class=container-fluid><div class="row mx-md-2"><div
class="td-footer__left col-6 col-sm-4 order-sm-1"><ul
class=td-footer__links-list><li class=td-f [...]
+described in the Parquet encryption specification.</p><div
class=td-page-meta__lastmod>Last modified March 11, 2024: <a
href=https://github.com/apache/parquet-site/commit/e79b30489c6bd50f0829a5f2b87f4a26f5e4af05>Fix
typos (#46) (e79b304)</a></div></div></main></div></div><footer
class="td-footer row d-print-none"><div class=container-fluid><div class="row
mx-md-2"><div class="td-footer__left col-6 col-sm-4 order-sm-1"><ul
class=td-footer__links-list><li class=td-footer__links-item data-b [...]
2024
<span class=td-footer__authors>Apache Parquet</span></span><span
class=td-footer__all_rights_reserved>All Rights Reserved</span><span
class=ms-2><a href=https://policies.google.com/privacy target=_blank
rel=noopener>Privacy Policy</a></span></div></div></div></footer></div><script
src=/js/main.min.1f48fc7981e4db829114650dc98d270b6642a46c1e4ebddb8389ff0a463a6328.js
integrity="sha256-H0j8eYHk24KRFGUNyY0nC2ZCpGweTr3bg4n/CkY6Yyg="
crossorigin=anonymous></script><script defer src=/js/click-to [...]
\ No newline at end of file
diff --git a/output/docs/file-format/data-pages/_print/index.html
b/output/docs/file-format/data-pages/_print/index.html
index 23e8a71..dc134fe 100644
--- a/output/docs/file-format/data-pages/_print/index.html
+++ b/output/docs/file-format/data-pages/_print/index.html
@@ -25,7 +25,7 @@ If any ambiguity arises when implementing this format, the
implementation
provided by the <a href=https://zlib.net/>zlib compression library</a> is
authoritative.</p><p>Readers should support reading pages containing multiple
GZIP members, however,
as this has historically not been supported by all implementations, it is
recommended
that writers refrain from creating such pages by default for better
interoperability.</p><h3 id=lzo>LZO</h3><p>A codec based on or interoperable
with the
-<a href=http://www.oberhumer.com/opensource/lzo/>LZO compression
library</a>.</p><h3 id=brotli>BROTLI</h3><p>A codec based on the Brotli format
defined by
+<a href=https://www.oberhumer.com/opensource/lzo/>LZO compression
library</a>.</p><h3 id=brotli>BROTLI</h3><p>A codec based on the Brotli format
defined by
<a href=https://tools.ietf.org/html/rfc7932>RFC 7932</a>.
If any ambiguity arises when implementing this format, the implementation
provided by the <a href=https://github.com/google/brotli>Brotli compression
library</a>
@@ -37,10 +37,10 @@ this compression codec in their user-facing APIs, and
advise users to
switch to the newer, interoperable <code>LZ4_RAW</code> codec.</p><h3
id=zstd>ZSTD</h3><p>A codec based on the Zstandard format defined by
<a href=https://tools.ietf.org/html/rfc8478>RFC 8478</a>. If any ambiguity
arises
when implementing this format, the implementation provided by the
-<a href=https://facebook.github.io/zstd/>ZStandard compression library</a>
+<a href=https://facebook.github.io/zstd/>Zstandard compression library</a>
is authoritative.</p><h3 id=lz4_raw>LZ4_RAW</h3><p>A codec based on the <a
href=https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md>LZ4 block
format</a>.
If any ambiguity arises when implementing this format, the implementation
-provided by the <a href=http://www.lz4.org/>LZ4 compression library</a> is
authoritative.</p></div><div class=td-content
style=page-break-before:always><h1 id=pg-9aa971e751fdd370158d525ad337ef7a>2 -
Encodings</h1><p><a name=PLAIN></a></p><h3 id=plain-plain--0>Plain: (PLAIN =
0)</h3><p>Supported Types: all</p><p>This is the plain encoding that must be
supported for types. It is
+provided by the <a href=https://www.lz4.org/>LZ4 compression library</a> is
authoritative.</p></div><div class=td-content
style=page-break-before:always><h1 id=pg-9aa971e751fdd370158d525ad337ef7a>2 -
Encodings</h1><p><a name=PLAIN></a></p><h3 id=plain-plain--0>Plain: (PLAIN =
0)</h3><p>Supported Types: all</p><p>This is the plain encoding that must be
supported for types. It is
intended to be the simplest encoding. Values are encoded back to
back.</p><p>The plain encoding is used whenever a more efficient encoding can
not be used. It
stores the data in the following format:</p><ul><li>BOOLEAN: <a
href=/docs/file-format/data-pages/encodings/#BITPACKED>Bit Packed</a>, LSB
first</li><li>INT32: 4 bytes little endian</li><li>INT64: 8 bytes little
endian</li><li>INT96: 12 bytes little endian (deprecated)</li><li>FLOAT: 4
bytes IEEE little endian</li><li>DOUBLE: 8 bytes IEEE little
endian</li><li>BYTE_ARRAY: length in 4 bytes little endian followed by the
bytes contained in the array</li><li>FIXED_LEN_BYTE_ARRAY: the bytes [...]
point types are encoded in IEEE.</p><p>For the byte array type, it encodes the
length as a 4 byte little
@@ -108,7 +108,7 @@ bit label: ABC DEF GHI JKL MNO PQR STU VWX
bit label: ABCDEFGH IJKLMNOP QRSTUVWX
</code></pre><p>Note that the BIT_PACKED encoding method is only supported for
encoding
repetition and definition levels.</p><p><a name=DELTAENC></a></p><h3
id=delta-encoding-delta_binary_packed--5>Delta Encoding (DELTA_BINARY_PACKED =
5)</h3><p>Supported Types: INT32, INT64</p><p>This encoding is adapted from the
Binary packing described in
-<a href=http://arxiv.org/pdf/1209.2137v5.pdf>“Decoding billions of
integers per second through vectorization”</a>
+<a href=https://arxiv.org/pdf/1209.2137v5.pdf>“Decoding billions of
integers per second through vectorization”</a>
by D. Lemire and L. Boytsov.</p><p>In delta encoding we make use of variable
length integers for storing various
numbers (not the deltas themselves). For unsigned values, we use ULEB128,
which is the unsigned version of LEB128 (<a
href=https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128)>https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128)</a>.
@@ -120,7 +120,7 @@ quotient, the number of values in a miniblock, is a
multiple of 32; it is
stored as a ULEB128 int</li><li>the total value count is stored as a ULEB128
int</li><li>the first value is stored as a zigzag ULEB128 int</li></ul><p>Each
block contains</p><pre tabindex=0><code><min delta> <list of bitwidths
of miniblocks> <miniblocks>
</code></pre><ul><li>the min delta is a zigzag ULEB128 int (we compute a
minimum as we need
positive integers for bit packing)</li><li>the bitwidth of each block is
stored as a byte</li><li>each miniblock is a list of bit packed ints according
to the bit width
-stored at the begining of the block</li></ul><p>To encode a block, we
will:</p><ol><li><p>Compute the differences between consecutive elements. For
the first
+stored at the beginning of the block</li></ul><p>To encode a block, we
will:</p><ol><li><p>Compute the differences between consecutive elements. For
the first
element in the block, use the last element in the previous block or, in
the case of the first block, use the first value of the whole sequence,
stored in the header.</p></li><li><p>Compute the frame of reference (the
minimum of the deltas in the block).
@@ -269,7 +269,7 @@ data set (table). This string is optionally passed by a
writer upon file creatio
the AAD prefix is stored in an <code>aad_prefix</code> field in the file, and
is made available to the readers.
This field is not encrypted. If a user is concerned about keeping the file
identity inside the file,
the writer code can explicitly request Parquet not to store the AAD prefix.
Then the aad_prefix field
-will be empty; AAD prefixes must be fully managed by the caller code and
supplied explictly to Parquet
+will be empty; AAD prefixes must be fully managed by the caller code and
supplied explicitly to Parquet
readers for each file.</p><p>The protection against swapping full files is
optional. It is not enabled by default because
it requires the writers to generate and pass an AAD prefix.</p><p>A reader of
a file created with an AAD prefix, should be able to verify the prefix (file
identity)
by comparing it with e.g. the target table name, using a convention accepted
in the organization.
diff --git a/output/docs/file-format/data-pages/compression/index.html
b/output/docs/file-format/data-pages/compression/index.html
index 1c301cb..be03cd1 100644
--- a/output/docs/file-format/data-pages/compression/index.html
+++ b/output/docs/file-format/data-pages/compression/index.html
@@ -3,9 +3,9 @@
The detailed specifications of compression codecs are maintained externally by
their respective authors or maintainers, which we reference hereafter.
For all compression codecs except the deprecated LZ4 codec, the raw data of a
(data or dictionary) page is fed as-is to the underlying compression library,
without any additional framing or padding."><meta property="og:title"
content="Compression"><meta property="og:description" content="Overview Parquet
allows the data block inside dictionary pages and data pages to be compressed
for better space efficiency. The Parquet format supports several compression
covering different areas in the [...]
The detailed specifications of compression codecs are maintained externally by
their respective authors or maintainers, which we reference hereafter.
-For all compression codecs except the deprecated LZ4 codec, the raw data of a
(data or dictionary) page is fed as-is to the underlying compression library,
without any additional framing or padding."><meta property="og:type"
content="article"><meta property="og:url"
content="/docs/file-format/data-pages/compression/"><meta
property="article:section" content="docs"><meta
property="article:modified_time" content="2024-03-08T16:33:45-05:00"><meta
property="og:site_name" content="Apache Parq [...]
+For all compression codecs except the deprecated LZ4 codec, the raw data of a
(data or dictionary) page is fed as-is to the underlying compression library,
without any additional framing or padding."><meta property="og:type"
content="article"><meta property="og:url"
content="/docs/file-format/data-pages/compression/"><meta
property="article:section" content="docs"><meta
property="article:modified_time" content="2024-03-11T22:11:10+01:00"><meta
property="og:site_name" content="Apache Parq [...]
The detailed specifications of compression codecs are maintained externally by
their respective authors or maintainers, which we reference hereafter.
-For all compression codecs except the deprecated LZ4 codec, the raw data of a
(data or dictionary) page is fed as-is to the underlying compression library,
without any additional framing or padding."><meta itemprop=dateModified
content="2024-03-08T16:33:45-05:00"><meta itemprop=wordCount
content="379"><meta itemprop=keywords content><meta name=twitter:card
content="summary"><meta name=twitter:title content="Compression"><meta
name=twitter:description content="Overview Parquet allows the [...]
+For all compression codecs except the deprecated LZ4 codec, the raw data of a
(data or dictionary) page is fed as-is to the underlying compression library,
without any additional framing or padding."><meta itemprop=dateModified
content="2024-03-11T22:11:10+01:00"><meta itemprop=wordCount
content="379"><meta itemprop=keywords content><meta name=twitter:card
content="summary"><meta name=twitter:title content="Compression"><meta
name=twitter:description content="Overview Parquet allows the [...]
The detailed specifications of compression codecs are maintained externally by
their respective authors or maintainers, which we reference hereafter.
For all compression codecs except the deprecated LZ4 codec, the raw data of a
(data or dictionary) page is fed as-is to the underlying compression library,
without any additional framing or padding."><link rel=preload
href=/scss/main.min.c57c6f762acece9f1b5c78f58a507b2e5ff04d2c4c951e6678debf7d71d25341.css
as=style><link
href=/scss/main.min.c57c6f762acece9f1b5c78f58a507b2e5ff04d2c4c951e6678debf7d71d25341.css
rel=stylesheet integrity><script
src=https://code.jquery.com/jquery-3.6.3.min.js [...]
<a
href=https://github.com/apache/parquet-site/edit/production/content/en/docs/File%20Format/Data%20Pages/compression.md
class="td-page-meta--edit td-page-meta__edit" target=_blank rel=noopener><i
class="fa-solid fa-pen-to-square fa-fw"></i> Edit this page</a>
@@ -29,7 +29,7 @@ If any ambiguity arises when implementing this format, the
implementation
provided by the <a href=https://zlib.net/>zlib compression library</a> is
authoritative.</p><p>Readers should support reading pages containing multiple
GZIP members, however,
as this has historically not been supported by all implementations, it is
recommended
that writers refrain from creating such pages by default for better
interoperability.</p><h3 id=lzo>LZO</h3><p>A codec based on or interoperable
with the
-<a href=http://www.oberhumer.com/opensource/lzo/>LZO compression
library</a>.</p><h3 id=brotli>BROTLI</h3><p>A codec based on the Brotli format
defined by
+<a href=https://www.oberhumer.com/opensource/lzo/>LZO compression
library</a>.</p><h3 id=brotli>BROTLI</h3><p>A codec based on the Brotli format
defined by
<a href=https://tools.ietf.org/html/rfc7932>RFC 7932</a>.
If any ambiguity arises when implementing this format, the implementation
provided by the <a href=https://github.com/google/brotli>Brotli compression
library</a>
@@ -41,9 +41,9 @@ this compression codec in their user-facing APIs, and advise
users to
switch to the newer, interoperable <code>LZ4_RAW</code> codec.</p><h3
id=zstd>ZSTD</h3><p>A codec based on the Zstandard format defined by
<a href=https://tools.ietf.org/html/rfc8478>RFC 8478</a>. If any ambiguity
arises
when implementing this format, the implementation provided by the
-<a href=https://facebook.github.io/zstd/>ZStandard compression library</a>
+<a href=https://facebook.github.io/zstd/>Zstandard compression library</a>
is authoritative.</p><h3 id=lz4_raw>LZ4_RAW</h3><p>A codec based on the <a
href=https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md>LZ4 block
format</a>.
If any ambiguity arises when implementing this format, the implementation
-provided by the <a href=http://www.lz4.org/>LZ4 compression library</a> is
authoritative.</p><div class=td-page-meta__lastmod>Last modified March 8, 2024:
<a
href=https://github.com/apache/parquet-site/commit/b3b81ce3e9f9e6f25b41f463577976628515384a>Update
to new website (b3b81ce)</a></div></div></main></div></div><footer
class="td-footer row d-print-none"><div class=container-fluid><div class="row
mx-md-2"><div class="td-footer__left col-6 col-sm-4 order-sm-1"><ul
class=td-footer__links [...]
+provided by the <a href=https://www.lz4.org/>LZ4 compression library</a> is
authoritative.</p><div class=td-page-meta__lastmod>Last modified March 11,
2024: <a
href=https://github.com/apache/parquet-site/commit/e79b30489c6bd50f0829a5f2b87f4a26f5e4af05>Fix
typos (#46) (e79b304)</a></div></div></main></div></div><footer
class="td-footer row d-print-none"><div class=container-fluid><div class="row
mx-md-2"><div class="td-footer__left col-6 col-sm-4 order-sm-1"><ul
class=td-footer__links-lis [...]
2024
<span class=td-footer__authors>Apache Parquet</span></span><span
class=td-footer__all_rights_reserved>All Rights Reserved</span><span
class=ms-2><a href=https://policies.google.com/privacy target=_blank
rel=noopener>Privacy Policy</a></span></div></div></div></footer></div><script
src=/js/main.min.1f48fc7981e4db829114650dc98d270b6642a46c1e4ebddb8389ff0a463a6328.js
integrity="sha256-H0j8eYHk24KRFGUNyY0nC2ZCpGweTr3bg4n/CkY6Yyg="
crossorigin=anonymous></script><script defer src=/js/click-to [...]
\ No newline at end of file
diff --git a/output/docs/file-format/data-pages/encodings/index.html
b/output/docs/file-format/data-pages/encodings/index.html
index c725fc6..9cc6584 100644
--- a/output/docs/file-format/data-pages/encodings/index.html
+++ b/output/docs/file-format/data-pages/encodings/index.html
@@ -5,10 +5,10 @@ The plain encoding is used whenever a more efficient encoding
can not be used. I
BOOLEAN: Bit Packed, LSB first INT32: 4 bytes little endian INT64: 8 bytes
little endian INT96: 12 bytes little endian (deprecated) FLOAT: 4 bytes IEEE
little endian DOUBLE: 8 bytes IEEE little endian BYTE_ARRAY: length in 4 bytes
little endian followed by the bytes contained in the array
FIXED_LEN_BYTE_ARRAY: the bytes contained in the array For native types, this
outputs the data as little endian."><meta property="og:title"
content="Encodings"><meta property="og:description" content="P [...]
This is the plain encoding that must be supported for types. It is intended to
be the simplest encoding. Values are encoded back to back.
The plain encoding is used whenever a more efficient encoding can not be used.
It stores the data in the following format:
-BOOLEAN: Bit Packed, LSB first INT32: 4 bytes little endian INT64: 8 bytes
little endian INT96: 12 bytes little endian (deprecated) FLOAT: 4 bytes IEEE
little endian DOUBLE: 8 bytes IEEE little endian BYTE_ARRAY: length in 4 bytes
little endian followed by the bytes contained in the array
FIXED_LEN_BYTE_ARRAY: the bytes contained in the array For native types, this
outputs the data as little endian."><meta property="og:type"
content="article"><meta property="og:url" content="/docs/file-f [...]
+BOOLEAN: Bit Packed, LSB first INT32: 4 bytes little endian INT64: 8 bytes
little endian INT96: 12 bytes little endian (deprecated) FLOAT: 4 bytes IEEE
little endian DOUBLE: 8 bytes IEEE little endian BYTE_ARRAY: length in 4 bytes
little endian followed by the bytes contained in the array
FIXED_LEN_BYTE_ARRAY: the bytes contained in the array For native types, this
outputs the data as little endian."><meta property="og:type"
content="article"><meta property="og:url" content="/docs/file-f [...]
This is the plain encoding that must be supported for types. It is intended to
be the simplest encoding. Values are encoded back to back.
The plain encoding is used whenever a more efficient encoding can not be used.
It stores the data in the following format:
-BOOLEAN: Bit Packed, LSB first INT32: 4 bytes little endian INT64: 8 bytes
little endian INT96: 12 bytes little endian (deprecated) FLOAT: 4 bytes IEEE
little endian DOUBLE: 8 bytes IEEE little endian BYTE_ARRAY: length in 4 bytes
little endian followed by the bytes contained in the array
FIXED_LEN_BYTE_ARRAY: the bytes contained in the array For native types, this
outputs the data as little endian."><meta itemprop=dateModified
content="2024-01-14T20:32:15+08:00"><meta itemprop=wordCount [...]
+BOOLEAN: Bit Packed, LSB first INT32: 4 bytes little endian INT64: 8 bytes
little endian INT96: 12 bytes little endian (deprecated) FLOAT: 4 bytes IEEE
little endian DOUBLE: 8 bytes IEEE little endian BYTE_ARRAY: length in 4 bytes
little endian followed by the bytes contained in the array
FIXED_LEN_BYTE_ARRAY: the bytes contained in the array For native types, this
outputs the data as little endian."><meta itemprop=dateModified
content="2024-03-11T22:11:10+01:00"><meta itemprop=wordCount [...]
This is the plain encoding that must be supported for types. It is intended to
be the simplest encoding. Values are encoded back to back.
The plain encoding is used whenever a more efficient encoding can not be used.
It stores the data in the following format:
BOOLEAN: Bit Packed, LSB first INT32: 4 bytes little endian INT64: 8 bytes
little endian INT96: 12 bytes little endian (deprecated) FLOAT: 4 bytes IEEE
little endian DOUBLE: 8 bytes IEEE little endian BYTE_ARRAY: length in 4 bytes
little endian followed by the bytes contained in the array
FIXED_LEN_BYTE_ARRAY: the bytes contained in the array For native types, this
outputs the data as little endian."><link rel=preload
href=/scss/main.min.c57c6f762acece9f1b5c78f58a507b2e5ff04d2c4c951e6678 [...]
@@ -83,7 +83,7 @@ bit label: ABC DEF GHI JKL MNO PQR STU VWX
bit label: ABCDEFGH IJKLMNOP QRSTUVWX
</code></pre><p>Note that the BIT_PACKED encoding method is only supported for
encoding
repetition and definition levels.</p><p><a name=DELTAENC></a></p><h3
id=delta-encoding-delta_binary_packed--5>Delta Encoding (DELTA_BINARY_PACKED =
5)</h3><p>Supported Types: INT32, INT64</p><p>This encoding is adapted from the
Binary packing described in
-<a href=http://arxiv.org/pdf/1209.2137v5.pdf>“Decoding billions of
integers per second through vectorization”</a>
+<a href=https://arxiv.org/pdf/1209.2137v5.pdf>“Decoding billions of
integers per second through vectorization”</a>
by D. Lemire and L. Boytsov.</p><p>In delta encoding we make use of variable
length integers for storing various
numbers (not the deltas themselves). For unsigned values, we use ULEB128,
which is the unsigned version of LEB128 (<a
href=https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128)>https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128)</a>.
@@ -95,7 +95,7 @@ quotient, the number of values in a miniblock, is a multiple
of 32; it is
stored as a ULEB128 int</li><li>the total value count is stored as a ULEB128
int</li><li>the first value is stored as a zigzag ULEB128 int</li></ul><p>Each
block contains</p><pre tabindex=0><code><min delta> <list of bitwidths
of miniblocks> <miniblocks>
</code></pre><ul><li>the min delta is a zigzag ULEB128 int (we compute a
minimum as we need
positive integers for bit packing)</li><li>the bitwidth of each block is
stored as a byte</li><li>each miniblock is a list of bit packed ints according
to the bit width
-stored at the begining of the block</li></ul><p>To encode a block, we
will:</p><ol><li><p>Compute the differences between consecutive elements. For
the first
+stored at the beginning of the block</li></ul><p>To encode a block, we
will:</p><ol><li><p>Compute the differences between consecutive elements. For
the first
element in the block, use the last element in the previous block or, in
the case of the first block, use the first value of the whole sequence,
stored in the header.</p></li><li><p>Compute the frame of reference (the
minimum of the deltas in the block).
@@ -142,6 +142,6 @@ is allowed inside the data page.</p><p>Example:
Original data is three 32-bit floats and for simplicity we look at their raw
representation.</p><pre tabindex=0><code> Element 0 Element 1
Element 2
Bytes AA BB CC DD 00 11 22 33 A3 B4 C5 D6
</code></pre><p>After applying the transformation, the data has the following
representation:</p><pre tabindex=0><code>Bytes AA 00 A3 BB 11 B4 CC 22 C5 DD
33 D6
-</code></pre><div class=td-page-meta__lastmod>Last modified January 14, 2024:
<a
href=https://github.com/apache/parquet-site/commit/7cf58a9ec47d96608dfec9771179691301ede3ce>Sync
site with format release v2.10.0
(7cf58a9)</a></div></div></main></div></div><footer class="td-footer row
d-print-none"><div class=container-fluid><div class="row mx-md-2"><div
class="td-footer__left col-6 col-sm-4 order-sm-1"><ul
class=td-footer__links-list><li class=td-footer__links-item
data-bs-toggle=tooltip [...]
+</code></pre><div class=td-page-meta__lastmod>Last modified March 11, 2024: <a
href=https://github.com/apache/parquet-site/commit/e79b30489c6bd50f0829a5f2b87f4a26f5e4af05>Fix
typos (#46) (e79b304)</a></div></div></main></div></div><footer
class="td-footer row d-print-none"><div class=container-fluid><div class="row
mx-md-2"><div class="td-footer__left col-6 col-sm-4 order-sm-1"><ul
class=td-footer__links-list><li class=td-footer__links-item
data-bs-toggle=tooltip title="User mailing list [...]
2024
<span class=td-footer__authors>Apache Parquet</span></span><span
class=td-footer__all_rights_reserved>All Rights Reserved</span><span
class=ms-2><a href=https://policies.google.com/privacy target=_blank
rel=noopener>Privacy Policy</a></span></div></div></div></footer></div><script
src=/js/main.min.1f48fc7981e4db829114650dc98d270b6642a46c1e4ebddb8389ff0a463a6328.js
integrity="sha256-H0j8eYHk24KRFGUNyY0nC2ZCpGweTr3bg4n/CkY6Yyg="
crossorigin=anonymous></script><script defer src=/js/click-to [...]
\ No newline at end of file
diff --git a/output/docs/file-format/data-pages/encryption/index.html
b/output/docs/file-format/data-pages/encryption/index.html
index 6d18ca4..ad996ae 100644
--- a/output/docs/file-format/data-pages/encryption/index.html
+++ b/output/docs/file-format/data-pages/encryption/index.html
@@ -1,8 +1,8 @@
<!doctype html><html itemscope itemtype=http://schema.org/WebPage lang=en
class=no-js><head><meta charset=utf-8><meta name=viewport
content="width=device-width,initial-scale=1,shrink-to-fit=no"><meta name=robots
content="index, follow"><link rel="shortcut icon"
href=/favicons/favicon.ico><link rel=apple-touch-icon
href=/favicons/apple-touch-icon-180x180.png sizes=180x180><link rel=icon
type=image/png href=/favicons/favicon-16x16.png sizes=16x16><link rel=icon
type=image/png href=/favicon [...]
<meta name=description content="Parquet files containing sensitive information
can be protected by the modular encryption mechanism that encrypts and
authenticates the file data and metadata - while allowing for a regular Parquet
functionality (columnar projection, predicate pushdown, encoding and
compression).
1 Problem Statement Existing data protection solutions (such as flat
encryption of files, in-storage encryption, or use of an encrypting storage
client) can be applied to Parquet files, but have various security or
performance issues."><meta property="og:title" content="Parquet Modular
Encryption"><meta property="og:description" content="Parquet files containing
sensitive information can be protected by the modular encryption mechanism that
encrypts and authenticates the file data and me [...]
-1 Problem Statement Existing data protection solutions (such as flat
encryption of files, in-storage encryption, or use of an encrypting storage
client) can be applied to Parquet files, but have various security or
performance issues."><meta property="og:type" content="article"><meta
property="og:url" content="/docs/file-format/data-pages/encryption/"><meta
property="article:section" content="docs"><meta
property="article:modified_time" content="2024-03-08T16:33:45-05:00"><meta
property= [...]
-1 Problem Statement Existing data protection solutions (such as flat
encryption of files, in-storage encryption, or use of an encrypting storage
client) can be applied to Parquet files, but have various security or
performance issues."><meta itemprop=dateModified
content="2024-03-08T16:33:45-05:00"><meta itemprop=wordCount
content="3943"><meta itemprop=keywords content><meta name=twitter:card
content="summary"><meta name=twitter:title content="Parquet Modular
Encryption"><meta name=twitt [...]
+1 Problem Statement Existing data protection solutions (such as flat
encryption of files, in-storage encryption, or use of an encrypting storage
client) can be applied to Parquet files, but have various security or
performance issues."><meta property="og:type" content="article"><meta
property="og:url" content="/docs/file-format/data-pages/encryption/"><meta
property="article:section" content="docs"><meta
property="article:modified_time" content="2024-03-11T22:11:10+01:00"><meta
property= [...]
+1 Problem Statement Existing data protection solutions (such as flat
encryption of files, in-storage encryption, or use of an encrypting storage
client) can be applied to Parquet files, but have various security or
performance issues."><meta itemprop=dateModified
content="2024-03-11T22:11:10+01:00"><meta itemprop=wordCount
content="3943"><meta itemprop=keywords content><meta name=twitter:card
content="summary"><meta name=twitter:title content="Parquet Modular
Encryption"><meta name=twitt [...]
1 Problem Statement Existing data protection solutions (such as flat
encryption of files, in-storage encryption, or use of an encrypting storage
client) can be applied to Parquet files, but have various security or
performance issues."><link rel=preload
href=/scss/main.min.c57c6f762acece9f1b5c78f58a507b2e5ff04d2c4c951e6678debf7d71d25341.css
as=style><link
href=/scss/main.min.c57c6f762acece9f1b5c78f58a507b2e5ff04d2c4c951e6678debf7d71d25341.css
rel=stylesheet integrity><script src=https:// [...]
<a
href=https://github.com/apache/parquet-site/edit/production/content/en/docs/File%20Format/Data%20Pages/encryption.md
class="td-page-meta--edit td-page-meta__edit" target=_blank rel=noopener><i
class="fa-solid fa-pen-to-square fa-fw"></i> Edit this page</a>
<a
href="https://github.com/apache/parquet-site/new/production/content/en/docs/File%20Format/Data%20Pages?filename=change-me.md&value=---%0Atitle%3A+%22Long+Page+Title%22%0AlinkTitle%3A+%22Short+Nav+Title%22%0Aweight%3A+100%0Adescription%3A+%3E-%0A+++++Page+description+for+heading+and+indexes.%0A---%0A%0A%23%23+Heading%0A%0AEdit+this+template+to+create+your+new+page.%0A%0A%2A+Give+it+a+good+name%2C+ending+in+%60.md%60+-+e.g.+%60getting-started.md%60%0A%2A+Edit+the+%22front+matter%22+
[...]
@@ -109,7 +109,7 @@ data set (table). This string is optionally passed by a
writer upon file creatio
the AAD prefix is stored in an <code>aad_prefix</code> field in the file, and
is made available to the readers.
This field is not encrypted. If a user is concerned about keeping the file
identity inside the file,
the writer code can explicitly request Parquet not to store the AAD prefix.
Then the aad_prefix field
-will be empty; AAD prefixes must be fully managed by the caller code and
supplied explictly to Parquet
+will be empty; AAD prefixes must be fully managed by the caller code and
supplied explicitly to Parquet
readers for each file.</p><p>The protection against swapping full files is
optional. It is not enabled by default because
it requires the writers to generate and pass an AAD prefix.</p><p>A reader of
a file created with an AAD prefix, should be able to verify the prefix (file
identity)
by comparing it with e.g. the target table name, using a convention accepted
in the organization.
@@ -290,6 +290,6 @@ data - calculated by comparing the page encryption overhead
(nonce + tag + lengt
to the default page size (1 MB). This is a rough estimation, and can change
with the encryption
algorithm (no 16-byte tag in AES_GCM_CTR_V1) and with page configuration or
data encoding/compression.</p><p>The throughput overhead of Parquet modular
encryption depends on whether AES enciphering is
done in software or hardware. In both cases, performing encryption on full
pages (~1MB buffers)
-instead of on much smaller individual data values causes AES to work at its
maximal speed.</p><div class=td-page-meta__lastmod>Last modified March 8, 2024:
<a
href=https://github.com/apache/parquet-site/commit/b3b81ce3e9f9e6f25b41f463577976628515384a>Update
to new website (b3b81ce)</a></div></div></main></div></div><footer
class="td-footer row d-print-none"><div class=container-fluid><div class="row
mx-md-2"><div class="td-footer__left col-6 col-sm-4 order-sm-1"><ul
class=td-footer__link [...]
+instead of on much smaller individual data values causes AES to work at its
maximal speed.</p><div class=td-page-meta__lastmod>Last modified March 11,
2024: <a
href=https://github.com/apache/parquet-site/commit/e79b30489c6bd50f0829a5f2b87f4a26f5e4af05>Fix
typos (#46) (e79b304)</a></div></div></main></div></div><footer
class="td-footer row d-print-none"><div class=container-fluid><div class="row
mx-md-2"><div class="td-footer__left col-6 col-sm-4 order-sm-1"><ul
class=td-footer__links-lis [...]
2024
<span class=td-footer__authors>Apache Parquet</span></span><span
class=td-footer__all_rights_reserved>All Rights Reserved</span><span
class=ms-2><a href=https://policies.google.com/privacy target=_blank
rel=noopener>Privacy Policy</a></span></div></div></div></footer></div><script
src=/js/main.min.1f48fc7981e4db829114650dc98d270b6642a46c1e4ebddb8389ff0a463a6328.js
integrity="sha256-H0j8eYHk24KRFGUNyY0nC2ZCpGweTr3bg4n/CkY6Yyg="
crossorigin=anonymous></script><script defer src=/js/click-to [...]
\ No newline at end of file
diff --git a/output/docs/file-format/data-pages/index.xml
b/output/docs/file-format/data-pages/index.xml
index fbef5ce..43e4faa 100644
--- a/output/docs/file-format/data-pages/index.xml
+++ b/output/docs/file-format/data-pages/index.xml
@@ -30,7 +30,7 @@ as this has historically not been supported by all
implementations, it is recomm
that writers refrain from creating such pages by default for better
interoperability.</p>
<h3 id="lzo">LZO</h3>
<p>A codec based on or interoperable with the
-<a href="http://www.oberhumer.com/opensource/lzo/">LZO compression
library</a>.</p>
+<a href="https://www.oberhumer.com/opensource/lzo/">LZO compression
library</a>.</p>
<h3 id="brotli">BROTLI</h3>
<p>A codec based on the Brotli format defined by
<a href="https://tools.ietf.org/html/rfc7932">RFC 7932</a>.
@@ -49,12 +49,12 @@ switch to the newer, interoperable
<code>LZ4_RAW</code> codec.</p>
<p>A codec based on the Zstandard format defined by
<a href="https://tools.ietf.org/html/rfc8478">RFC 8478</a>. If any
ambiguity arises
when implementing this format, the implementation provided by the
-<a href="https://facebook.github.io/zstd/">ZStandard compression
library</a>
+<a href="https://facebook.github.io/zstd/">Zstandard compression
library</a>
is authoritative.</p>
<h3 id="lz4_raw">LZ4_RAW</h3>
<p>A codec based on the <a
href="https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md">LZ4 block
format</a>.
If any ambiguity arises when implementing this format, the implementation
-provided by the <a href="http://www.lz4.org/">LZ4 compression
library</a> is authoritative.</p></description></item><item><title>Docs:
Encodings</title><link>/docs/file-format/data-pages/encodings/</link><pubDate>Mon,
01 Jan 0001 00:00:00
+0000</pubDate><guid>/docs/file-format/data-pages/encodings/</guid><description>
+provided by the <a href="https://www.lz4.org/">LZ4 compression
library</a> is authoritative.</p></description></item><item><title>Docs:
Encodings</title><link>/docs/file-format/data-pages/encodings/</link><pubDate>Mon,
01 Jan 0001 00:00:00
+0000</pubDate><guid>/docs/file-format/data-pages/encodings/</guid><description>
<p><a name="PLAIN"></a></p>
<h3 id="plain-plain--0">Plain: (PLAIN = 0)</h3>
<p>Supported Types: all</p>
@@ -180,7 +180,7 @@ repetition and definition levels.</p>
<h3 id="delta-encoding-delta_binary_packed--5">Delta Encoding
(DELTA_BINARY_PACKED = 5)</h3>
<p>Supported Types: INT32, INT64</p>
<p>This encoding is adapted from the Binary packing described in
-<a href="http://arxiv.org/pdf/1209.2137v5.pdf">&ldquo;Decoding billions
of integers per second through vectorization&rdquo;</a>
+<a href="https://arxiv.org/pdf/1209.2137v5.pdf">&ldquo;Decoding
billions of integers per second through vectorization&rdquo;</a>
by D. Lemire and L. Boytsov.</p>
<p>In delta encoding we make use of variable length integers for storing
various
numbers (not the deltas themselves). For unsigned values, we use ULEB128,
@@ -206,7 +206,7 @@ stored as a ULEB128 int</li>
positive integers for bit packing)</li>
<li>the bitwidth of each block is stored as a byte</li>
<li>each miniblock is a list of bit packed ints according to the bit width
-stored at the begining of the block</li>
+stored at the beginning of the block</li>
</ul>
<p>To encode a block, we will:</p>
<ol>
@@ -483,7 +483,7 @@ data set (table). This string is optionally passed by a
writer upon file creatio
the AAD prefix is stored in an <code>aad_prefix</code> field in the
file, and is made available to the readers.
This field is not encrypted. If a user is concerned about keeping the file
identity inside the file,
the writer code can explicitly request Parquet not to store the AAD prefix.
Then the aad_prefix field
-will be empty; AAD prefixes must be fully managed by the caller code and
supplied explictly to Parquet
+will be empty; AAD prefixes must be fully managed by the caller code and
supplied explicitly to Parquet
readers for each file.</p>
<p>The protection against swapping full files is optional. It is not
enabled by default because
it requires the writers to generate and pass an AAD prefix.</p>
diff --git a/output/docs/file-format/index.xml
b/output/docs/file-format/index.xml
index 8fbd04e..51006c0 100644
--- a/output/docs/file-format/index.xml
+++ b/output/docs/file-format/index.xml
@@ -170,7 +170,7 @@ chosen as follows:</p>
</span></span><span style="display:flex;"><span><span
style="color:#204a87;font-weight:bold">unsigned</span> <span
style="color:#000">int64</span> <span
style="color:#000">z_as_64_bit</span> <span
style="color:#ce5c00;font-weight:bold">=</span> <span
style="color:#000">z</span><span
style="color:#000;font-weight:bold">;</span>
</span></span><span style="display:flex;"><span><span
style="color:#204a87;font-weight:bold">unsigned</span> <span
style="color:#000">int32</span> <span style="color:#000">i</span>
<span style="color:#ce5c00;font-weight:bold">=</span> <span
style="color:#000;font-weight:bold">(</span><span
style="color:#000">h_top_bits</span> <span
style="color:#ce5c00;font-weight:bold">*</span> <span
style="color:#000">z_as_64_bit</span><spa [...]
</span></span></code></pre></div><p>The first line extracts
the most significant 32 bits from <code>h</code> and
-assignes them to a 64-bit unsigned integer. The second line is
+assigns them to a 64-bit unsigned integer. The second line is
simpler: it just sets an unsigned 64-bit value to the same value as
the 32-bit unsigned value <code>z</code>. The purpose of having both
<code>h_top_bits</code>
and <code>z_as_64_bit</code> be 64-bit values is so that their product
is a
@@ -206,7 +206,7 @@ unsigned int64 i = ((x &gt;&gt; 32) *
filter.numberOfBlocks()) &gt;&
block b = filter.getBlock(i);
return block_check(b, (unsigned int32)x)
}
-</code></pre><p>The use of blocks is from Putze et al.&rsquo;s
<a
href="http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf">Cache-,
Hash- and
+</code></pre><p>The use of blocks is from Putze et al.&rsquo;s
<a
href="https://www.cs.amherst.edu/~ccmcgeoch/cs34/papers/cacheefficientbloomfilters-jea.pdf">Cache-,
Hash- and
Space-Efficient Bloom
filters</a></p>
<p>To use an SBBF for values of arbitrary Parquet types, we apply a hash
@@ -217,13 +217,13 @@ with a seed of 0 and <a
href="https://github.com/Cyan4973/xxHash/blob/v0.7.0/
<h4 id="sizing-an-sbbf">Sizing an SBBF</h4>
<p>The <code>check</code> operation in SBBFs can return
<code>true</code> for an argument that
was never inserted into the SBBF. These are called &ldquo;false
-positives&rdquo;. The &ldquo;false positive probabilty&rdquo; is
the probability that
+positives&rdquo;. The &ldquo;false positive probability&rdquo; is
the probability that
any given hash value that was never <code>insert</code>ed into the SBBF
will
cause <code>check</code> to return <code>true</code> (a false
positive). There is not a
simple closed-form calculation of this probability, but here is an
example:</p>
<p>A filter that uses 1024 blocks and has had 26,214 hash values
-<code>insert</code>ed will have a false positive probabilty of around
1.26%. Each
+<code>insert</code>ed will have a false positive probability of around
1.26%. Each
of those 1024 blocks occupies 256 bits of space, so the total space
usage is 262,144. That means that the ratio of bits of space to hash
values is 10-to-1. Adding more hash values increases the denominator
diff --git a/output/docs/index.xml b/output/docs/index.xml
index c24f462..b796daf 100644
--- a/output/docs/index.xml
+++ b/output/docs/index.xml
@@ -30,7 +30,7 @@ as this has historically not been supported by all
implementations, it is recomm
that writers refrain from creating such pages by default for better
interoperability.</p>
<h3 id="lzo">LZO</h3>
<p>A codec based on or interoperable with the
-<a href="http://www.oberhumer.com/opensource/lzo/">LZO compression
library</a>.</p>
+<a href="https://www.oberhumer.com/opensource/lzo/">LZO compression
library</a>.</p>
<h3 id="brotli">BROTLI</h3>
<p>A codec based on the Brotli format defined by
<a href="https://tools.ietf.org/html/rfc7932">RFC 7932</a>.
@@ -49,12 +49,12 @@ switch to the newer, interoperable
<code>LZ4_RAW</code> codec.</p>
<p>A codec based on the Zstandard format defined by
<a href="https://tools.ietf.org/html/rfc8478">RFC 8478</a>. If any
ambiguity arises
when implementing this format, the implementation provided by the
-<a href="https://facebook.github.io/zstd/">ZStandard compression
library</a>
+<a href="https://facebook.github.io/zstd/">Zstandard compression
library</a>
is authoritative.</p>
<h3 id="lz4_raw">LZ4_RAW</h3>
<p>A codec based on the <a
href="https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md">LZ4 block
format</a>.
If any ambiguity arises when implementing this format, the implementation
-provided by the <a href="http://www.lz4.org/">LZ4 compression
library</a> is authoritative.</p></description></item><item><title>Docs:
Encodings</title><link>/docs/file-format/data-pages/encodings/</link><pubDate>Mon,
01 Jan 0001 00:00:00
+0000</pubDate><guid>/docs/file-format/data-pages/encodings/</guid><description>
+provided by the <a href="https://www.lz4.org/">LZ4 compression
library</a> is authoritative.</p></description></item><item><title>Docs:
Encodings</title><link>/docs/file-format/data-pages/encodings/</link><pubDate>Mon,
01 Jan 0001 00:00:00
+0000</pubDate><guid>/docs/file-format/data-pages/encodings/</guid><description>
<p><a name="PLAIN"></a></p>
<h3 id="plain-plain--0">Plain: (PLAIN = 0)</h3>
<p>Supported Types: all</p>
@@ -180,7 +180,7 @@ repetition and definition levels.</p>
<h3 id="delta-encoding-delta_binary_packed--5">Delta Encoding
(DELTA_BINARY_PACKED = 5)</h3>
<p>Supported Types: INT32, INT64</p>
<p>This encoding is adapted from the Binary packing described in
-<a href="http://arxiv.org/pdf/1209.2137v5.pdf">&ldquo;Decoding billions
of integers per second through vectorization&rdquo;</a>
+<a href="https://arxiv.org/pdf/1209.2137v5.pdf">&ldquo;Decoding
billions of integers per second through vectorization&rdquo;</a>
by D. Lemire and L. Boytsov.</p>
<p>In delta encoding we make use of variable length integers for storing
various
numbers (not the deltas themselves). For unsigned values, we use ULEB128,
@@ -206,7 +206,7 @@ stored as a ULEB128 int</li>
positive integers for bit packing)</li>
<li>the bitwidth of each block is stored as a byte</li>
<li>each miniblock is a list of bit packed ints according to the bit width
-stored at the begining of the block</li>
+stored at the beginning of the block</li>
</ul>
<p>To encode a block, we will:</p>
<ol>
@@ -483,7 +483,7 @@ data set (table). This string is optionally passed by a
writer upon file creatio
the AAD prefix is stored in an <code>aad_prefix</code> field in the
file, and is made available to the readers.
This field is not encrypted. If a user is concerned about keeping the file
identity inside the file,
the writer code can explicitly request Parquet not to store the AAD prefix.
Then the aad_prefix field
-will be empty; AAD prefixes must be fully managed by the caller code and
supplied explictly to Parquet
+will be empty; AAD prefixes must be fully managed by the caller code and
supplied explicitly to Parquet
readers for each file.</p>
<p>The protection against swapping full files is optional. It is not
enabled by default because
it requires the writers to generate and pass an AAD prefix.</p>
@@ -876,7 +876,7 @@ Java resources can be build using <code>mvn
package</code>. The current st
</ol>
<p>If you’d like to report a bug but don’t have time to fix it, you can
still post it to our <a
href="https://issues.apache.org/jira/browse/PARQUET">issue tracker</a>, or
email the mailing list (<a
href="mailto:[email protected]">[email protected]</a>).</p>
<h2 id="committers">Committers</h2>
-<p>Merging a pull request requires being a comitter on the project.</p>
+<p>Merging a pull request requires being a committer on the project.</p>
<p>How to merge a Pull request (have an apache and github-apache remote
setup):</p>
<pre><code>git remote add github-apache
[email protected]:apache/parquet-mr.git
git remote add apache https://gitbox.apache.org/repos/asf?p=parquet-mr.git
@@ -955,7 +955,7 @@ job in the <a
href="https://github.com/apache/parquet-site/blob/production/.g
</code></pre>
<p>If you have problems, read the <a
href="https://www.apache.org/dev/publishing-maven-artifacts.html">publishing
Maven artifacts documentation</a></p>
<h3 id="release-process">Release process</h3>
-<p>Parquet uses the maven-release-plugin to tag a release and push binary
artifacts to staging in Nexus. Once maven completes the release, the offical
source tarball is built from the tag.</p>
+<p>Parquet uses the maven-release-plugin to tag a release and push binary
artifacts to staging in Nexus. Once maven completes the release, the official
source tarball is built from the tag.</p>
<p>Before you start the release process:</p>
<ol>
<li>Verify that the release is finished (no planned JIRAs are pending and
all patches are cherry-picked to the release branch)</li>
@@ -1048,7 +1048,7 @@ svn add apache-parquet-&lt;version&gt;
svn ci -m &quot;Parquet: Add release &lt;VERSION&gt;&quot;
</code></pre>
<h4 id="4-update-parquetapacheorg">4. Update parquet.apache.org</h4>
-<p>Update the downloads page on parquet.apache.org. Instructions for
updating the site are on the <a
href="http://parquet.apache.org/docs/contribution-guidelines/contributing/">contribution
page</a>.</p>
+<p>Update the downloads page on parquet.apache.org. Instructions for
updating the site are on the <a
href="https://parquet.apache.org/docs/contribution-guidelines/contributing/">contribution
page</a>.</p>
<h4
id="5-send-an-announce-e-mail-to-announceapacheorgmailtoannounceapacheorg-and-the-dev-list">5.
Send an ANNOUNCE e-mail to <a
href="mailto:[email protected]">[email protected]</a> and the dev
list</h4>
<pre><code>[ANNOUNCE] Apache Parquet release &lt;VERSION&gt;
I'm please to announce the release of Parquet &lt;VERSION&gt;!
@@ -1226,7 +1226,7 @@ chosen as follows:</p>
</span></span><span style="display:flex;"><span><span
style="color:#204a87;font-weight:bold">unsigned</span> <span
style="color:#000">int64</span> <span
style="color:#000">z_as_64_bit</span> <span
style="color:#ce5c00;font-weight:bold">=</span> <span
style="color:#000">z</span><span
style="color:#000;font-weight:bold">;</span>
</span></span><span style="display:flex;"><span><span
style="color:#204a87;font-weight:bold">unsigned</span> <span
style="color:#000">int32</span> <span style="color:#000">i</span>
<span style="color:#ce5c00;font-weight:bold">=</span> <span
style="color:#000;font-weight:bold">(</span><span
style="color:#000">h_top_bits</span> <span
style="color:#ce5c00;font-weight:bold">*</span> <span
style="color:#000">z_as_64_bit</span><spa [...]
</span></span></code></pre></div><p>The first line extracts
the most significant 32 bits from <code>h</code> and
-assignes them to a 64-bit unsigned integer. The second line is
+assigns them to a 64-bit unsigned integer. The second line is
simpler: it just sets an unsigned 64-bit value to the same value as
the 32-bit unsigned value <code>z</code>. The purpose of having both
<code>h_top_bits</code>
and <code>z_as_64_bit</code> be 64-bit values is so that their product
is a
@@ -1262,7 +1262,7 @@ unsigned int64 i = ((x &gt;&gt; 32) *
filter.numberOfBlocks()) &gt;&
block b = filter.getBlock(i);
return block_check(b, (unsigned int32)x)
}
-</code></pre><p>The use of blocks is from Putze et al.&rsquo;s
<a
href="http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf">Cache-,
Hash- and
+</code></pre><p>The use of blocks is from Putze et al.&rsquo;s
<a
href="https://www.cs.amherst.edu/~ccmcgeoch/cs34/papers/cacheefficientbloomfilters-jea.pdf">Cache-,
Hash- and
Space-Efficient Bloom
filters</a></p>
<p>To use an SBBF for values of arbitrary Parquet types, we apply a hash
@@ -1273,13 +1273,13 @@ with a seed of 0 and <a
href="https://github.com/Cyan4973/xxHash/blob/v0.7.0/
<h4 id="sizing-an-sbbf">Sizing an SBBF</h4>
<p>The <code>check</code> operation in SBBFs can return
<code>true</code> for an argument that
was never inserted into the SBBF. These are called &ldquo;false
-positives&rdquo;. The &ldquo;false positive probabilty&rdquo; is
the probability that
+positives&rdquo;. The &ldquo;false positive probability&rdquo; is
the probability that
any given hash value that was never <code>insert</code>ed into the SBBF
will
cause <code>check</code> to return <code>true</code> (a false
positive). There is not a
simple closed-form calculation of this probability, but here is an
example:</p>
<p>A filter that uses 1024 blocks and has had 26,214 hash values
-<code>insert</code>ed will have a false positive probabilty of around
1.26%. Each
+<code>insert</code>ed will have a false positive probability of around
1.26%. Each
of those 1024 blocks occupies 256 bits of space, so the total space
usage is 262,144. That means that the ratio of bits of space to hash
values is 10-to-1. Adding more hash values increases the denominator
diff --git a/output/sitemap.xml b/output/sitemap.xml
index 3201984..54349f4 100644
--- a/output/sitemap.xml
+++ b/output/sitemap.xml
@@ -1 +1 @@
-<?xml version="1.0" encoding="utf-8" standalone="yes"?><urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xhtml="http://www.w3.org/1999/xhtml"><url><loc>/docs/file-format/data-pages/compression/</loc><lastmod>2024-03-08T16:33:45-05:00</lastmod></url><url><loc>/docs/file-format/data-pages/encodings/</loc><lastmod>2024-01-14T20:32:15+08:00</lastmod></url><url><loc>/docs/file-format/data-pages/encryption/</loc><lastmod>2024-03-08T16:33:45-05:00</lastmod></url><url><loc>/docs/
[...]
\ No newline at end of file
+<?xml version="1.0" encoding="utf-8" standalone="yes"?><urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xhtml="http://www.w3.org/1999/xhtml"><url><loc>/docs/file-format/data-pages/compression/</loc><lastmod>2024-03-11T22:11:10+01:00</lastmod></url><url><loc>/docs/file-format/data-pages/encodings/</loc><lastmod>2024-03-11T22:11:10+01:00</lastmod></url><url><loc>/docs/file-format/data-pages/encryption/</loc><lastmod>2024-03-11T22:11:10+01:00</lastmod></url><url><loc>/docs/
[...]
\ No newline at end of file