This is an automated email from the ASF dual-hosted git repository.

chaokunyang pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/incubator-fury-site.git


The following commit(s) were added to refs/heads/main by this push:
     new 4e3a34c  refine meta string blog (#121)
4e3a34c is described below

commit 4e3a34c34b4a1238e126112e0529676160f100c1
Author: Shawn Yang <[email protected]>
AuthorDate: Tue May 7 10:18:55 2024 +0800

    refine meta string blog (#121)
---
 ...tastring-space-efficient_encoding_for_string.md | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/blog/2024-05-06-metastring-space-efficient_encoding_for_string.md 
b/blog/2024-05-06-metastring-space-efficient_encoding_for_string.md
index 8d75c52..4129247 100644
--- a/blog/2024-05-06-metastring-space-efficient_encoding_for_string.md
+++ b/blog/2024-05-06-metastring-space-efficient_encoding_for_string.md
@@ -7,16 +7,16 @@ tags: [fury]
 
 ## Background
 
-In rpc/serialization systems, we often need to send 
**`namespace/path/filename/fieldName/packageName/moduleName/className/enumValue`**
 between processes.
+In rpc/serialization systems, we often need to send 
**`namespace/path/filename/fieldName/packageName/moduleName/className/enumValue`**
 string between processes.
 
-Those strings are mostly ascii strings. In order to transfer between 
processes, we often encode such strings using utf-8 encodings. Such encoding
+Those strings are mostly ascii strings. In order to transfer between 
processes, we encode such strings using utf-8 encodings. Such encoding
 will take one byte for every char, which is not space efficient actually.
 
-If we take a deeper look, we will found that most chars are **lower chars plus 
`.`, `$` and `_`**, which can be expressed in a much 
-smaller range **`0~32`**, and one byte can represent range `0~255`, the 
significant bits are wasted. And the cost is not ignorable, in a dynamic 
serialization
+If we take a deeper look, we will found that most chars are **lowercase chars, 
 `.`, `$` and `_`**, which can be expressed in a much 
+smaller range **`0~32`**. But one byte can represent range `0~255`, the 
significant bits are wasted, and this cost is not ignorable. In a dynamic 
serialization
 framework, such meta will take considerable cost compared to real data.
 
-So we proposed a new string encoding algorithm which we called **meta string 
encoding**. It will encode most chars using less bits instead of `8` bits in 
utf-8 encoding.
+So we proposed a new string encoding algorithm which we called **meta string 
encoding** in Fury. It will encode most chars using `5` bits instead of `8` 
bits in utf-8 encoding, which can bring **37.5% space cost savings** compared 
to utf-8 encoding.
 
 ## Meta String Introduction
 
@@ -36,9 +36,10 @@ String binary encoding algorithm:
 | LOWER_SPECIAL             | `a-z._$\|`    | every char is written using 5 
bits, `a-z`: `0b00000~0b11001`, `._$\|`: `0b11010~0b11101`, prepend one bit at 
the start to indicate whether strip last char since last byte may have 7 
redundant bits(1 indicates strip last char)                                     
                   |
 | LOWER_UPPER_DIGIT_SPECIAL | `a-zA-Z0~9._` | every char is written using 6 
bits, `a-z`: `0b00000~0b11001`, `A-Z`: `0b11010~0b110011`, `0~9`: 
`0b110100~0b111101`, `._`: `0b111110~0b111111`,  prepend one bit at the start 
to indicate whether strip last char since last byte may have 7 redundant bits(1 
indicates strip last char) |
 | UTF-8                     | any chars     | UTF-8 encoding                   
                                                                                
                                                                                
                                                                                
        |
-If we use `LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL`, we must add a strip last 
char flag in encoded data. This is because every char will be encoded using 
`5/6` bits, and the last char may have `1~7` bits which are unused by encoding, 
such bits may cause an extra char read, which we must strip off.
 
-Encoding code snippet in java, see 
[`org.apache.fury.meta.MetaStringEncoder#encodeGeneric(char[], 
int)`](https://github.com/apache/incubator-fury/blob/93800888595065b2690fec093ab0cbfd6ac7dedc/java/fury-core/src/main/java/org/apache/fury/meta/MetaStringEncoder.java#L235)
 for more detailed:
+If we use `LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL`, we must add a strip last 
char flag in encoded data. This is because every char will be encoded using 
`5/6` bits, and the last char may have `1~7` bits which are unused by encoding, 
such bits may cause an extra char to be read, which we must strip off.
+
+Here is encoding code snippet in java, see 
[`org.apache.fury.meta.MetaStringEncoder#encodeGeneric(char[], 
int)`](https://github.com/apache/incubator-fury/blob/93800888595065b2690fec093ab0cbfd6ac7dedc/java/fury-core/src/main/java/org/apache/fury/meta/MetaStringEncoder.java#L235)
 for more details:
 ```java
 private byte[] encodeGeneric(char[] chars, int bitsPerChar) {
   int totalBits = chars.length * bitsPerChar + 1;
@@ -100,7 +101,7 @@ private int charToValueLowerUpperDigitSpecial(char c) {
 }
 ```
 
-Decoding code snippet in golang, see 
[`go/fury/meta/meta_string_decoder.go:70`](https://github.com/apache/incubator-fury/blob/93800888595065b2690fec093ab0cbfd6ac7dedc/go/fury/meta/meta_string_decoder.go#L70)
 for more details:
+Here is decoding code snippet in golang, see 
[`go/fury/meta/meta_string_decoder.go:70`](https://github.com/apache/incubator-fury/blob/93800888595065b2690fec093ab0cbfd6ac7dedc/go/fury/meta/meta_string_decoder.go#L70)
 for more details:
 ```go
 func (d *Decoder) decodeGeneric(data []byte, algorithm Encoding) ([]byte, 
error) {
        bitsPerChar := 5
@@ -139,8 +140,7 @@ func (d *Decoder) decodeGeneric(data []byte, algorithm 
Encoding) ([]byte, error)
 
 ## Select Best Encoding
 
-For most lower chars, meta string will use `5` bits to encode every char. For 
string containing upper chars, meta string will try to convert the string into a
-lower representation by inserting some markers, and compare used bytes with 
`6` bits encoding, then select the encoding which has smaller encoded size.
+For most lowercase characters, meta string will use `5` bits to encode every 
char. For string containing uppercase chars, meta string will try to convert 
the string into a lower case representation by inserting some markers, and 
compare used bytes with `6` bits encoding, then select the encoding which has 
smaller encoded size.
 
 Here is the common encoding selection strategy:
 
@@ -158,7 +158,7 @@ For package name, module name or namespace, `LOWER_SPECIAL` 
will be used mostly.
 For className, `FIRST_TO_LOWER_SPECIAL` will be used mostly. If there are 
multiple uppercase chars, then `ALL_TO_LOWER_SPECIAL` will be used instead.
 If a string contains digits, then `LOWER_UPPER_DIGIT_SPECIAL` will be used.
 
-Finally, utf8 will be the fallback encoding if the string contains some chars 
not in range `a-z0-9A-Z`.
+Finally, utf8 will be the fallback encoding if the string contains some chars 
which is not in range `a-z0-9A-Z`.
 
 
 ## Encoding Flags and Data jointly


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to