This is an automated email from the ASF dual-hosted git repository.
dongjoon pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/orc.git
The following commit(s) were added to refs/heads/main by this push:
new cccbe7253 MINOR: Fix Patched Base doc in specification
cccbe7253 is described below
commit cccbe7253717c740fd1ed40d094e7496abc2329e
Author: Jefffrey <[email protected]>
AuthorDate: Tue Jul 9 12:17:23 2024 -0700
MINOR: Fix Patched Base doc in specification
### What changes were proposed in this pull request?
Fix patched base specification to state that only 5% of values are patched,
not 10%
### Why are the changes needed?
According to implementation:
https://github.com/apache/orc/blob/0828c2ff114f30c84e4a23fd42ed58c6615c6f97/java/core/src/java/org/apache/orc/impl/RunLengthIntegerWriterV2.java#L535-L550
- Also 10% of 512 doesn't fit in max patch list length of 31
Also fix some formatting issues.
Before:

After:

### How was this patch tested?
N/A
### Was this patch authored or co-authored using generative AI tooling?
No
Closes #1948 from Jefffrey/patched-base-doc-fix.
Authored-by: Jefffrey <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
---
site/specification/ORCv1.md | 8 ++++----
site/specification/ORCv2.md | 8 ++++----
2 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/site/specification/ORCv1.md b/site/specification/ORCv1.md
index 9aede7a4a..dffbf9034 100644
--- a/site/specification/ORCv1.md
+++ b/site/specification/ORCv1.md
@@ -804,8 +804,8 @@ length of 4 (3) as [0x5e, 0x03, 0x5c, 0xa1, 0xab, 0x1e,
0xde, 0xad,
The patched base encoding is used for integer sequences whose bit
widths varies a lot. The minimum signed value of the sequence is found
and subtracted from the other values. The bit width of those adjusted
-values is analyzed and the 90 percentile of the bit width is chosen
-as W. The 10\% of values larger than W use patches from a patch list
+values is analyzed and the 95 percentile of the bit width is chosen
+as W. The 5% of values larger than W use patches from a patch list
to set the additional bits. Patches are encoded as a list of gaps in
the index values and the additional value bits.
@@ -830,8 +830,8 @@ the index values and the additional value bits.
patch, and a patch value. Patches are applied by logically or'ing
the data values with the relevant patch shifted W bits left. If a
patch is 0, it was introduced to skip over more than 255 items. The
- combined length of each patch (PGW + PW) must be less or equal to
- 64. (PGW + PW) is padded to the closest fixed bit size according to the
+ combined length of each patch (PGW + PW) must be less or equal to 64.
+ (PGW + PW) is padded to the closest fixed bit size according to the
below table before being encoded in the patch list.
(PGW + PW) | closestFixedBits(PGW + PW)
diff --git a/site/specification/ORCv2.md b/site/specification/ORCv2.md
index 2e0c35462..0c773990c 100644
--- a/site/specification/ORCv2.md
+++ b/site/specification/ORCv2.md
@@ -823,8 +823,8 @@ length of 4 (3) as [0x5e, 0x03, 0x5c, 0xa1, 0xab, 0x1e,
0xde, 0xad,
The patched base encoding is used for integer sequences whose bit
widths varies a lot. The minimum signed value of the sequence is found
and subtracted from the other values. The bit width of those adjusted
-values is analyzed and the 90 percentile of the bit width is chosen
-as W. The 10\% of values larger than W use patches from a patch list
+values is analyzed and the 95 percentile of the bit width is chosen
+as W. The 5% of values larger than W use patches from a patch list
to set the additional bits. Patches are encoded as a list of gaps in
the index values and the additional value bits.
@@ -849,8 +849,8 @@ the index values and the additional value bits.
patch, and a patch value. Patches are applied by logically or'ing
the data values with the relevant patch shifted W bits left. If a
patch is 0, it was introduced to skip over more than 255 items. The
- combined length of each patch (PGW + PW) must be less or equal to
- 64. (PGW + PW) is padded to the closest fixed bit size according to the
+ combined length of each patch (PGW + PW) must be less or equal to 64.
+ (PGW + PW) is padded to the closest fixed bit size according to the
below table before being encoded in the patch list.
(PGW + PW) | closestFixedBits(PGW + PW)