This is an automated email from the ASF dual-hosted git repository.
chaokunyang pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/incubator-fury.git
The following commit(s) were added to refs/heads/main by this push:
new 8f79cb0e fix(spec): fix special char overflow in meta string encoding
(#1513)
8f79cb0e is described below
commit 8f79cb0e60a123a12476337ddd15105de0422407
Author: Shawn Yang <[email protected]>
AuthorDate: Mon Apr 15 19:33:49 2024 +0800
fix(spec): fix special char overflow in meta string encoding (#1513)
## What does this PR do?
This PR fix special char overflow in meta string encoding.
The `a-zA-Z._$` are 65 chars, which can't be expressed by 6 bits, which
is range `0~63`
## Related issues
<!--
Is there any related issue? Please attach here.
- #xxxx0
- #xxxx1
- #xxxx2
-->
## Does this PR introduce any user-facing change?
<!--
If any user-facing interface changes, please [open an
issue](https://github.com/apache/incubator-fury/issues/new/choose)
describing the need to do so and update the document if necessary.
-->
- [ ] Does this PR introduce any public API change?
- [ ] Does this PR introduce any binary protocol compatibility change?
## Benchmark
<!--
When the PR has an impact on performance (if you don't know whether the
PR will have an impact on performance, you can submit the PR first, and
if it will have impact on performance, the code reviewer will explain
it), be sure to attach a benchmark data here.
-->
---------
Co-authored-by: LiangliangSui
<[email protected]>
---
docs/specification/java_serialization_spec.md | 37 +++++++++++++++-----------
docs/specification/xlang_serialization_spec.md | 35 +++++++++++++-----------
2 files changed, 40 insertions(+), 32 deletions(-)
diff --git a/docs/specification/java_serialization_spec.md
b/docs/specification/java_serialization_spec.md
index a5d0beca..b05af49d 100644
--- a/docs/specification/java_serialization_spec.md
+++ b/docs/specification/java_serialization_spec.md
@@ -3,6 +3,7 @@ title: Fury Java Serialization Format
sidebar_position: 1
id: fury_java_serialization_spec
---
+
# Fury Java Serialization Specification
## Spec overview
@@ -222,25 +223,29 @@ Meta string is mainly used to encode meta strings such as
class name and field n
String binary encoding algorithm:
-| Algorithm | Pattern | Description
|
-|---------------------------|----------------|--------------------------------------------------------------------------------------------------------------------------------------------------|
-| LOWER_SPECIAL | `a-z._$\|` | every char is written using 5
bits, `a-z`: `0b00000~0b11001`, `._$\|`: `0b11010~0b11101`
|
-| LOWER_UPPER_DIGIT_SPECIAL | `a-zA-Z0~9._$` | every char is written using 6
bits, `a-z`: `0b00000~0b11110`, `A-Z`: `0b11010~0b110011`, `0~9`:
`0b110100~0b111101`, `._$`: `0b111110~0b1000000` |
-| UTF-8 | any chars | UTF-8 encoding
|
+| Algorithm | Pattern | Description
|
+|---------------------------|--------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| LOWER_SPECIAL | `a-z._$\|` | every char is written using
5 bits, `a-z`: `0b00000~0b11001`, `._$\|`: `0b11010~0b11101`
|
+| LOWER_UPPER_DIGIT_SPECIAL | `a-zA-Z0~9[c1,c2]` | every char is written using
6 bits, `a-z`: `0b00000~0b11001`, `A-Z`: `0b11010~0b110011`, `0~9`:
`0b110100~0b111101`, `c1,c2`: `0b111110~0b111111`, `c1,c2` should be two of
`._$` |
+| UTF-8 | any chars | UTF-8 encoding
|
Encoding flags:
-| Encoding Flag | Pattern
| Encoding Algorithm
|
-|---------------------------|-----------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
-| LOWER_SPECIAL | every char is in `a-z._$\|`
| `LOWER_SPECIAL`
|
-| REP_FIRST_LOWER_SPECIAL | every char is in `a-z._$` except first char is
upper case | replace first upper case char to lower case, then use
`LOWER_SPECIAL` |
-| REP_MUL_LOWER_SPECIAL | every char is in `a-zA-Z._$`
| replace every upper case char by `\|` + `lower case`, then use
`LOWER_SPECIAL`, use this encoding if it's smaller than Encoding `3` |
-| LOWER_UPPER_DIGIT_SPECIAL | every char is in `a-zA-Z._$`
| use `LOWER_UPPER_DIGIT_SPECIAL` encoding if it's smaller than
Encoding `2` |
-| UTF8 | any utf-8 char
| use `UTF-8` encoding
|
-| Compression | any utf-8 char
| lossless compression
|
-
-Depending on cases, one can choose encoding `flags + data` jointly, uses 3
bits of first byte for flags and other bytes
-for data.
+| Encoding Flag | Pattern
| Encoding Algorithm
|
+|---------------------------|---------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| LOWER_SPECIAL | every char is in `a-z._$\|`
| `LOWER_SPECIAL`
|
+| FIRST_TO_LOWER_SPECIAL | every char is in `a-z[c1,c2]` except first char
is upper case | replace first upper case char to lower case, then use
`LOWER_SPECIAL`
|
+| ALL_TO_LOWER_SPECIAL | every char is in `a-zA-Z[c1,c2]`
| replace every upper case char by `\|` + `lower case`, then use
`LOWER_SPECIAL`, use this encoding if it's smaller than Encoding
`LOWER_UPPER_DIGIT_SPECIAL` |
+| LOWER_UPPER_DIGIT_SPECIAL | every char is in `a-zA-Z[c1,c2]`
| use `LOWER_UPPER_DIGIT_SPECIAL` encoding if it's smaller than
Encoding `FIRST_TO_LOWER_SPECIAL`
|
+| UTF8 | any utf-8 char
| use `UTF-8` encoding
|
+| Compression | any utf-8 char
| lossless compression
|
+
+Notes:
+
+- For package name encoding, `c1,c2` should be `._`; For field/type name
encoding, `c1,c2` should be `_$`;
+- Depending on cases, one can choose encoding `flags + data` jointly, uses 3
bits of first byte for flags and other
+ bytes
+ for data.
### Shared meta string
diff --git a/docs/specification/xlang_serialization_spec.md
b/docs/specification/xlang_serialization_spec.md
index 4641d2ba..dd8c672e 100644
--- a/docs/specification/xlang_serialization_spec.md
+++ b/docs/specification/xlang_serialization_spec.md
@@ -338,25 +338,28 @@ Meta string is mainly used to encode meta strings such as
field names.
String binary encoding algorithm:
-| Algorithm | Pattern | Description
|
-|---------------------------|----------------|--------------------------------------------------------------------------------------------------------------------------------------------------|
-| LOWER_SPECIAL | `a-z._$\|` | every char is written using 5
bits, `a-z`: `0b00000~0b11001`, `._$\|`: `0b11010~0b11101`
|
-| LOWER_UPPER_DIGIT_SPECIAL | `a-zA-Z0~9._$` | every char is written using 6
bits, `a-z`: `0b00000~0b11110`, `A-Z`: `0b11010~0b110011`, `0~9`:
`0b110100~0b111101`, `._$`: `0b111110~0b1000000` |
-| UTF-8 | any chars | UTF-8 encoding
|
+| Algorithm | Pattern | Description
|
+|---------------------------|---------------|------------------------------------------------------------------------------------------------------------------------------------------------|
+| LOWER_SPECIAL | `a-z._$\|` | every char is written using 5
bits, `a-z`: `0b00000~0b11001`, `._$\|`: `0b11010~0b11101`
|
+| LOWER_UPPER_DIGIT_SPECIAL | `a-zA-Z0~9._` | every char is written using 6
bits, `a-z`: `0b00000~0b11001`, `A-Z`: `0b11010~0b110011`, `0~9`:
`0b110100~0b111101`, `._`: `0b111110~0b111111` |
+| UTF-8 | any chars | UTF-8 encoding
|
Encoding flags:
-| Encoding Flag | Pattern
| Encoding Algorithm
|
-|---------------------------|-----------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
-| LOWER_SPECIAL | every char is in `a-z._$\|`
| `LOWER_SPECIAL`
|
-| REP_FIRST_LOWER_SPECIAL | every char is in `a-z._$` except first char is
upper case | replace first upper case char to lower case, then use
`LOWER_SPECIAL` |
-| REP_MUL_LOWER_SPECIAL | every char is in `a-zA-Z._$`
| replace every upper case char by `\|` + `lower case`, then use
`LOWER_SPECIAL`, use this encoding if it's smaller than Encoding `3` |
-| LOWER_UPPER_DIGIT_SPECIAL | every char is in `a-zA-Z._$`
| use `LOWER_UPPER_DIGIT_SPECIAL` encoding if it's smaller than
Encoding `2` |
-| UTF8 | any utf-8 char
| use `UTF-8` encoding
|
-| Compression | any utf-8 char
| lossless compression
|
-
-Depending on cases, one can choose encoding `flags + data` jointly, uses 3
bits of first byte for flags and other bytes
-for data.
+| Encoding Flag | Pattern
| Encoding Algorithm
|
+|---------------------------|----------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| LOWER_SPECIAL | every char is in `a-z._\|`
| `LOWER_SPECIAL`
|
+| FIRST_TO_LOWER_SPECIAL | every char is in `a-z._` except first char is
upper case | replace first upper case char to lower case, then use
`LOWER_SPECIAL`
|
+| ALL_TO_LOWER_SPECIAL | every char is in `a-zA-Z._`
| replace every upper case char by `\|` + `lower case`, then use
`LOWER_SPECIAL`, use this encoding if it's smaller than Encoding
`LOWER_UPPER_DIGIT_SPECIAL` |
+| LOWER_UPPER_DIGIT_SPECIAL | every char is in `a-zA-Z._`
| use `LOWER_UPPER_DIGIT_SPECIAL` encoding if it's smaller than
Encoding `FIRST_TO_LOWER_SPECIAL`
|
+| UTF8 | any utf-8 char
| use `UTF-8` encoding
|
+| Compression | any utf-8 char
| lossless compression
|
+
+Notes:
+
+- Depending on cases, one can choose encoding `flags + data` jointly, uses 3
bits of first byte for flags and other
+ bytes
+ for data.
## Value Format
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]