This is an automated email from the ASF dual-hosted git repository.

chaokunyang pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/incubator-fury.git


The following commit(s) were added to refs/heads/main by this push:
     new 8f79cb0e fix(spec): fix special char overflow in meta string encoding 
(#1513)
8f79cb0e is described below

commit 8f79cb0e60a123a12476337ddd15105de0422407
Author: Shawn Yang <[email protected]>
AuthorDate: Mon Apr 15 19:33:49 2024 +0800

    fix(spec): fix special char overflow in meta string encoding (#1513)
    
    ## What does this PR do?
    This PR  fix special char overflow in meta string encoding.
    
    The `a-zA-Z._$` are 65 chars, which can't be expressed by 6 bits, which
    is range `0~63`
    
    ## Related issues
    
    <!--
    Is there any related issue? Please attach here.
    
    - #xxxx0
    - #xxxx1
    - #xxxx2
    -->
    
    
    ## Does this PR introduce any user-facing change?
    
    <!--
    If any user-facing interface changes, please [open an
    issue](https://github.com/apache/incubator-fury/issues/new/choose)
    describing the need to do so and update the document if necessary.
    -->
    
    - [ ] Does this PR introduce any public API change?
    - [ ] Does this PR introduce any binary protocol compatibility change?
    
    
    ## Benchmark
    
    <!--
    When the PR has an impact on performance (if you don't know whether the
    PR will have an impact on performance, you can submit the PR first, and
    if it will have impact on performance, the code reviewer will explain
    it), be sure to attach a benchmark data here.
    -->
    
    ---------
    
    Co-authored-by: LiangliangSui 
<[email protected]>
---
 docs/specification/java_serialization_spec.md  | 37 +++++++++++++++-----------
 docs/specification/xlang_serialization_spec.md | 35 +++++++++++++-----------
 2 files changed, 40 insertions(+), 32 deletions(-)

diff --git a/docs/specification/java_serialization_spec.md 
b/docs/specification/java_serialization_spec.md
index a5d0beca..b05af49d 100644
--- a/docs/specification/java_serialization_spec.md
+++ b/docs/specification/java_serialization_spec.md
@@ -3,6 +3,7 @@ title: Fury Java Serialization Format
 sidebar_position: 1
 id: fury_java_serialization_spec
 ---
+
 # Fury Java Serialization Specification
 
 ## Spec overview
@@ -222,25 +223,29 @@ Meta string is mainly used to encode meta strings such as 
class name and field n
 
 String binary encoding algorithm:
 
-| Algorithm                 | Pattern        | Description                     
                                                                                
                                 |
-|---------------------------|----------------|--------------------------------------------------------------------------------------------------------------------------------------------------|
-| LOWER_SPECIAL             | `a-z._$\|`     | every char is written using 5 
bits, `a-z`: `0b00000~0b11001`, `._$\|`: `0b11010~0b11101`                      
                                   |
-| LOWER_UPPER_DIGIT_SPECIAL | `a-zA-Z0~9._$` | every char is written using 6 
bits, `a-z`: `0b00000~0b11110`, `A-Z`: `0b11010~0b110011`, `0~9`: 
`0b110100~0b111101`, `._$`: `0b111110~0b1000000` |
-| UTF-8                     | any chars      | UTF-8 encoding                  
                                                                                
                                 |
+| Algorithm                 | Pattern            | Description                 
                                                                                
                                                                      |
+|---------------------------|--------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| LOWER_SPECIAL             | `a-z._$\|`         | every char is written using 
5 bits, `a-z`: `0b00000~0b11001`, `._$\|`: `0b11010~0b11101`                    
                                                                      |
+| LOWER_UPPER_DIGIT_SPECIAL | `a-zA-Z0~9[c1,c2]` | every char is written using 
6 bits, `a-z`: `0b00000~0b11001`, `A-Z`: `0b11010~0b110011`, `0~9`: 
`0b110100~0b111101`, `c1,c2`: `0b111110~0b111111`, `c1,c2` should be two of 
`._$` |
+| UTF-8                     | any chars          | UTF-8 encoding              
                                                                                
                                                                      |
 
 Encoding flags:
 
-| Encoding Flag             | Pattern                                          
         | Encoding Algorithm                                                   
                                                               |
-|---------------------------|-----------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
-| LOWER_SPECIAL             | every char is in `a-z._$\|`                      
         | `LOWER_SPECIAL`                                                      
                                                               |
-| REP_FIRST_LOWER_SPECIAL   | every char is in `a-z._$` except first char is 
upper case | replace first upper case char to lower case, then use 
`LOWER_SPECIAL`                                                               |
-| REP_MUL_LOWER_SPECIAL     | every char is in `a-zA-Z._$`                     
         | replace every upper case char by `\|` + `lower case`, then use 
`LOWER_SPECIAL`, use this encoding if it's smaller than Encoding `3` |
-| LOWER_UPPER_DIGIT_SPECIAL | every char is in `a-zA-Z._$`                     
         | use `LOWER_UPPER_DIGIT_SPECIAL` encoding if it's smaller than 
Encoding `2`                                                          |
-| UTF8                      | any utf-8 char                                   
         | use `UTF-8` encoding                                                 
                                                               |
-| Compression               | any utf-8 char                                   
         | lossless compression                                                 
                                                               |
-
-Depending on cases, one can choose encoding `flags + data` jointly, uses 3 
bits of first byte for flags and other bytes
-for data.
+| Encoding Flag             | Pattern                                          
             | Encoding Algorithm                                               
                                                                                
           |
+|---------------------------|---------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| LOWER_SPECIAL             | every char is in `a-z._$\|`                      
             | `LOWER_SPECIAL`                                                  
                                                                                
           |
+| FIRST_TO_LOWER_SPECIAL    | every char is in `a-z[c1,c2]` except first char 
is upper case | replace first upper case char to lower case, then use 
`LOWER_SPECIAL`                                                                 
                      |
+| ALL_TO_LOWER_SPECIAL      | every char is in `a-zA-Z[c1,c2]`                 
             | replace every upper case char by `\|` + `lower case`, then use 
`LOWER_SPECIAL`, use this encoding if it's smaller than Encoding 
`LOWER_UPPER_DIGIT_SPECIAL` |
+| LOWER_UPPER_DIGIT_SPECIAL | every char is in `a-zA-Z[c1,c2]`                 
             | use `LOWER_UPPER_DIGIT_SPECIAL` encoding if it's smaller than 
Encoding `FIRST_TO_LOWER_SPECIAL`                                               
              |
+| UTF8                      | any utf-8 char                                   
             | use `UTF-8` encoding                                             
                                                                                
           |
+| Compression               | any utf-8 char                                   
             | lossless compression                                             
                                                                                
           |
+
+Notes:
+
+- For package name encoding, `c1,c2` should be `._`; For field/type name 
encoding, `c1,c2` should be `_$`;
+- Depending on cases, one can choose encoding `flags + data` jointly, uses 3 
bits of first byte for flags and other
+  bytes
+  for data.
 
 ### Shared meta string
 
diff --git a/docs/specification/xlang_serialization_spec.md 
b/docs/specification/xlang_serialization_spec.md
index 4641d2ba..dd8c672e 100644
--- a/docs/specification/xlang_serialization_spec.md
+++ b/docs/specification/xlang_serialization_spec.md
@@ -338,25 +338,28 @@ Meta string is mainly used to encode meta strings such as 
field names.
 
 String binary encoding algorithm:
 
-| Algorithm                 | Pattern        | Description                     
                                                                                
                                 |
-|---------------------------|----------------|--------------------------------------------------------------------------------------------------------------------------------------------------|
-| LOWER_SPECIAL             | `a-z._$\|`     | every char is written using 5 
bits, `a-z`: `0b00000~0b11001`, `._$\|`: `0b11010~0b11101`                      
                                   |
-| LOWER_UPPER_DIGIT_SPECIAL | `a-zA-Z0~9._$` | every char is written using 6 
bits, `a-z`: `0b00000~0b11110`, `A-Z`: `0b11010~0b110011`, `0~9`: 
`0b110100~0b111101`, `._$`: `0b111110~0b1000000` |
-| UTF-8                     | any chars      | UTF-8 encoding                  
                                                                                
                                 |
+| Algorithm                 | Pattern       | Description                      
                                                                                
                              |
+|---------------------------|---------------|------------------------------------------------------------------------------------------------------------------------------------------------|
+| LOWER_SPECIAL             | `a-z._$\|`    | every char is written using 5 
bits, `a-z`: `0b00000~0b11001`, `._$\|`: `0b11010~0b11101`                      
                                 |
+| LOWER_UPPER_DIGIT_SPECIAL | `a-zA-Z0~9._` | every char is written using 6 
bits, `a-z`: `0b00000~0b11001`, `A-Z`: `0b11010~0b110011`, `0~9`: 
`0b110100~0b111101`, `._`: `0b111110~0b111111` |
+| UTF-8                     | any chars     | UTF-8 encoding                   
                                                                                
                              |
 
 Encoding flags:
 
-| Encoding Flag             | Pattern                                          
         | Encoding Algorithm                                                   
                                                               |
-|---------------------------|-----------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
-| LOWER_SPECIAL             | every char is in `a-z._$\|`                      
         | `LOWER_SPECIAL`                                                      
                                                               |
-| REP_FIRST_LOWER_SPECIAL   | every char is in `a-z._$` except first char is 
upper case | replace first upper case char to lower case, then use 
`LOWER_SPECIAL`                                                               |
-| REP_MUL_LOWER_SPECIAL     | every char is in `a-zA-Z._$`                     
         | replace every upper case char by `\|` + `lower case`, then use 
`LOWER_SPECIAL`, use this encoding if it's smaller than Encoding `3` |
-| LOWER_UPPER_DIGIT_SPECIAL | every char is in `a-zA-Z._$`                     
         | use `LOWER_UPPER_DIGIT_SPECIAL` encoding if it's smaller than 
Encoding `2`                                                          |
-| UTF8                      | any utf-8 char                                   
         | use `UTF-8` encoding                                                 
                                                               |
-| Compression               | any utf-8 char                                   
         | lossless compression                                                 
                                                               |
-
-Depending on cases, one can choose encoding `flags + data` jointly, uses 3 
bits of first byte for flags and other bytes
-for data.
+| Encoding Flag             | Pattern                                          
        | Encoding Algorithm                                                    
                                                                                
      |
+|---------------------------|----------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| LOWER_SPECIAL             | every char is in `a-z._\|`                       
        | `LOWER_SPECIAL`                                                       
                                                                                
      |
+| FIRST_TO_LOWER_SPECIAL    | every char is in `a-z._` except first char is 
upper case | replace first upper case char to lower case, then use 
`LOWER_SPECIAL`                                                                 
                      |
+| ALL_TO_LOWER_SPECIAL      | every char is in `a-zA-Z._`                      
        | replace every upper case char by `\|` + `lower case`, then use 
`LOWER_SPECIAL`, use this encoding if it's smaller than Encoding 
`LOWER_UPPER_DIGIT_SPECIAL` |
+| LOWER_UPPER_DIGIT_SPECIAL | every char is in `a-zA-Z._`                      
        | use `LOWER_UPPER_DIGIT_SPECIAL` encoding if it's smaller than 
Encoding `FIRST_TO_LOWER_SPECIAL`                                               
              |
+| UTF8                      | any utf-8 char                                   
        | use `UTF-8` encoding                                                  
                                                                                
      |
+| Compression               | any utf-8 char                                   
        | lossless compression                                                  
                                                                                
      |
+
+Notes:
+
+- Depending on cases, one can choose encoding `flags + data` jointly, uses 3 
bits of first byte for flags and other
+  bytes
+  for data.
 
 ## Value Format
 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to