write performance (#2803)

chaokunyang Wed, 22 Oct 2025 01:43:41 -0700

This is an automated email from the ASF dual-hosted git repository.

chaokunyang pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/fory.git



The following commit(s) were added to refs/heads/main by this push:
     new cc934c07c perf(rust): optimize rust small string/struct read/write 
performance (#2803)
cc934c07c is described below

commit cc934c07ca70d78071c0d94f351533a6e69af09e
Author: Shawn Yang <[email protected]>
AuthorDate: Wed Oct 22 16:42:08 2025 +0800

    perf(rust): optimize rust small string/struct read/write performance (#2803)
    
    ## Why?
    
    <!-- Describe the purpose of this PR. -->
    
    ## What does this PR do?
    
    - optimize rust string read/write performance
    - add inline hints to optimize small struct serialize performance
    
    ## Related issues
    
    Closes #2802
    
    ## Does this PR introduce any user-facing change?
    
    <!--
    If any user-facing interface changes, please [open an
    issue](https://github.com/apache/fory/issues/new/choose) describing the
    need to do so and update the document if necessary.
    
    Delete section if not applicable.
    -->
    
    - [ ] Does this PR introduce any public API change?
    - [ ] Does this PR introduce any binary protocol compatibility change?
    
    ## Benchmark
    
    # SimpleStruct Comparison Performance Report
    
    This compares **Fory**, **Protobuf**, and **JSON** across **serialize**
    and **deserialize** for **small**, **medium**, and **large** payloads.
    
    ---
    
    ## 1. Serialization (Time in ns, Lower = Better)
    
    | Size | Fory | Protobuf | JSON | Fastest | Change Summary |
    
    
|----------|-------------|-------------|-------------|-------------|----------------|
    | Small | **125.78** | 187.98 | 225.71 | Fory | Fory ↑, Protobuf ↓, JSON
    ↑ |
    | Medium | **127.99** | 250.21 | 250.61 | Fory | Fory ↑, Protobuf ↓,
    JSON ↓ |
    | Large | **153.31** | 247.91 | 598.14 | Fory | Fory ↑, Protobuf ↑, JSON
    ↓ |
    
    **Note:** “↑” = improved performance (faster), “↓” = regression
    (slower).
    
    ---
    
    ## 2. Deserialization (Time in ns, Lower = Better)
    
    | Size | Fory | Protobuf | JSON | Fastest | Change Summary |
    
    
|----------|--------------|-------------|-------------|-------------|----------------|
    | Small | 163.28 | **100.94** | 247.23 | Protobuf | Fory ↑, Protobuf ↓,
    JSON ↓ |
    | Medium | 175.83 | **93.52** | 271.57 | Protobuf | Fory ↑, Protobuf ↔,
    JSON ↔ |
    | Large | 175.66 | **107.36** | 350.12 | Protobuf | Fory ↑, Protobuf ↔,
    JSON ↔ |
    
    **Note:** “↔” = no significant change.
    
    ---
    
    ## 3. Overall Trends
    
    ### **Fory**
    - **Serialization:** Consistently fastest in all sizes, **huge gains**
    (up to ~80% faster on large payloads).
    - **Deserialization:** Slower than Protobuf but **significant
    improvements** (up to ~46% faster compared to previous run).
    
    ### **Protobuf**
    - **Serialization:** Slower than Fory, **regressed** for small & medium,
    slightly improved for large.
    - **Deserialization:** Fastest in all sizes (especially for small
    payloads), mostly unchanged except small case regressed.
    
    ### **JSON**
    - **Serialization:** Always slowest, small improved, medium & large
    regressed.
    - **Deserialization:** Always slowest, mostly unchanged, small case
    regressed.
    
    ---
    
    ## 4. Key Takeaways
    
    1. **Fory is now clearly the best choice for serialization speed**
    across all payload sizes.
    2. **Protobuf retains the crown for deserialization speed**, especially
    for small and medium payloads.
    3. **JSON remains the slowest** in both serialization and
    deserialization, and showed regression in many cases.
    4. For workloads that serialize often: use **Fory**.
    5. For workloads that deserialize small payloads often: **Protobuf**
    still wins.
    
    # Ecommerce Data Serialization/Deserialization Performance Report
    
    ## Serialize Performance (lower is better)
    
    | Size | Fory Serialize | Protobuf Serialize | JSON Serialize | Fastest
    |
    
    
|--------|--------------------|--------------------|----------------------|---------|
    | Small | **0.935 µs** | 7.37 µs | 9.74 µs | Fory |
    | Medium | **34.86 µs** | 421.91 µs | 485.89 µs | Fory |
    | Large | **665.25 µs** | 10.971 ms | 8.0948 ms | Fory |
    
    ## Deserialize Performance (lower is better)
    
    | Size | Fory Deserialize | Protobuf Deserialize | JSON Deserialize |
    Fastest |
    
    
|--------|--------------------|----------------------|---------------------|---------|
    | Small | **7.6366 µs** | 9.1811 µs | 14.086 µs | Fory |
    | Medium | **404.89 µs** | 606.06 µs | 719.57 µs | Fory |
    | Large | **6.4556 ms** | 10.544 ms | 11.479 ms | Fory |
    
    ---
    
    ## Observations
    
    1. **Fory** outperforms Protobuf and JSON in **all cases**, both
    serialization and deserialization.
    2. Performance gap is especially large for medium and large datasets
    where Fory is:
       - ~12× faster than Protobuf serialization for large data.
       - ~1.7× faster than JSON serialization for large data.
    3. Small dataset serialization for Fory is **extremely** fast (~0.935
    µs) compared to Protobuf (~7.37 µs) and JSON (~9.74 µs).
    
    ---
    
    ## Relative Speedups (Fory vs others)
    
    ### Serialize
    - Small: Fory vs Protobuf → **7.9× faster**
    - Medium: Fory vs Protobuf → **12× faster**
    - Large: Fory vs Protobuf → **16.5× faster**
    - Small: Fory vs JSON → **10.4× faster**
    - Medium: Fory vs JSON → **13.9× faster**
    - Large: Fory vs JSON → **12.2× faster**
    
    ### Deserialize
    - Small: Fory vs Protobuf → **1.20× faster**
    - Medium: Fory vs Protobuf → **1.50× faster**
    - Large: Fory vs Protobuf → **1.63× faster**
    - Small: Fory vs JSON → **1.84× faster**
    - Medium: Fory vs JSON → **1.78× faster**
    - Large: Fory vs JSON → **1.78× faster**
    
    ---
    
    ## Conclusion
    The **Fory** format is consistently the fastest across all dataset sizes
    and both serialization/deserialization.
    For performance‑critical ecommerce data pipelines, replacing Protobuf
    and JSON with Fory could yield **substantial latency reductions**,
    especially in large dataset scenarios.
---
 rust/fory-core/src/buffer.rs                 | 125 +++++++++++++++++++-
 rust/fory-core/src/fory.rs                   |   7 +-
 rust/fory-core/src/meta/string_util.rs       | 164 +++++++++++++++++++--------
 rust/fory-core/src/resolver/ref_resolver.rs  |  14 +++
 rust/fory-core/src/resolver/type_resolver.rs |  21 ++++
 rust/fory-core/src/serializer/core.rs        |   5 +
 rust/fory-core/src/serializer/number.rs      |  18 +--
 rust/fory-core/src/serializer/string.rs      |  21 +++-
 rust/fory-derive/src/object/serializer.rs    |  15 +++
 9 files changed, 327 insertions(+), 63 deletions(-)

diff --git a/rust/fory-core/src/buffer.rs b/rust/fory-core/src/buffer.rs
index d38bf4fcb..d5c00cb91 100644
--- a/rust/fory-core/src/buffer.rs
+++ b/rust/fory-core/src/buffer.rs
@@ -23,6 +23,10 @@ use crate::meta::buffer_rw_string::{
 use byteorder::{ByteOrder, LittleEndian, WriteBytesExt};
 use std::slice;
 
+/// Threshold for using SIMD optimizations in string operations.
+/// For buffers smaller than this, direct copy is faster than SIMD setup 
overhead.
+const SIMD_THRESHOLD: usize = 128;
+
 #[derive(Default)]
 pub struct Writer {
     pub(crate) bf: Vec<u8>,
@@ -325,16 +329,59 @@ impl Writer {
 
     #[inline(always)]
     pub fn write_latin1_string(&mut self, s: &str) {
+        if s.len() < SIMD_THRESHOLD {
+            // Fast path for small buffers
+            let bytes = s.as_bytes();
+            // CRITICAL: Only safe if ASCII (UTF-8 == Latin1 for ASCII)
+            let is_ascii = bytes.iter().all(|&b| b < 0x80);
+            if is_ascii {
+                self.bf.reserve(s.len());
+                self.bf.extend_from_slice(bytes);
+            } else {
+                // Non-ASCII: must iterate chars to extract Latin1 byte values
+                self.bf.reserve(s.len());
+                for c in s.chars() {
+                    let v = c as u32;
+                    assert!(v <= 0xFF, "Non-Latin1 character found");
+                    self.bf.push(v as u8);
+                }
+            }
+            return;
+        }
         write_latin1_simd(self, s);
     }
 
     #[inline(always)]
     pub fn write_utf8_string(&mut self, s: &str) {
-        write_utf8_simd(self, s);
+        let bytes = s.as_bytes();
+        let len = bytes.len();
+
+        if len < SIMD_THRESHOLD {
+            // Fast path for small strings - direct copy avoids SIMD overhead
+            // For small strings, the branch cost + simple copy is faster than 
SIMD setup
+            self.bf.reserve(len);
+            self.bf.extend_from_slice(bytes);
+        } else {
+            // Use SIMD for larger strings where the overhead is amortized
+            write_utf8_simd(self, s);
+        }
     }
 
     #[inline(always)]
     pub fn write_utf16_bytes(&mut self, bytes: &[u16]) {
+        let total_bytes = bytes.len() * 2;
+        if total_bytes < SIMD_THRESHOLD {
+            // Fast path for small UTF-16 data - direct copy
+            let old_len = self.bf.len();
+            self.bf.reserve(total_bytes);
+            unsafe {
+                let dest = self.bf.as_mut_ptr().add(old_len);
+                let src = bytes.as_ptr() as *const u8;
+                std::ptr::copy_nonoverlapping(src, dest, total_bytes);
+                self.bf.set_len(old_len + total_bytes);
+            }
+            return;
+        }
         write_utf16_simd(self, bytes);
     }
 }
@@ -617,18 +664,90 @@ impl Reader {
     #[inline(always)]
     pub fn read_latin1_string(&mut self, len: usize) -> Result<String, Error> {
         self.check_bound(len)?;
-        read_latin1_simd(self, len)
+        if len < SIMD_THRESHOLD {
+            // Fast path for small buffers
+            unsafe {
+                let src = std::slice::from_raw_parts(self.bf.add(self.cursor), 
len);
+
+                // Check if all bytes are ASCII (< 0x80)
+                let is_ascii = src.iter().all(|&b| b < 0x80);
+
+                if is_ascii {
+                    // ASCII fast path: Latin1 == UTF-8, direct copy
+                    let mut vec = Vec::with_capacity(len);
+                    let dst = vec.as_mut_ptr();
+                    std::ptr::copy_nonoverlapping(src.as_ptr(), dst, len);
+                    vec.set_len(len);
+                    self.move_next(len);
+                    Ok(String::from_utf8_unchecked(vec))
+                } else {
+                    // Contains Latin1 bytes (0x80-0xFF): must convert to UTF-8
+                    let mut out: Vec<u8> = Vec::with_capacity(len * 2);
+                    let out_ptr = out.as_mut_ptr();
+                    let mut out_len = 0;
+
+                    for &b in src {
+                        if b < 0x80 {
+                            *out_ptr.add(out_len) = b;
+                            out_len += 1;
+                        } else {
+                            // Latin1 -> UTF-8 encoding
+                            *out_ptr.add(out_len) = 0xC0 | (b >> 6);
+                            *out_ptr.add(out_len + 1) = 0x80 | (b & 0x3F);
+                            out_len += 2;
+                        }
+                    }
+
+                    out.set_len(out_len);
+                    self.move_next(len);
+                    Ok(String::from_utf8_unchecked(out))
+                }
+            }
+        } else {
+            // Use SIMD for larger strings where the overhead is amortized
+            read_latin1_simd(self, len)
+        }
     }
 
     #[inline(always)]
     pub fn read_utf8_string(&mut self, len: usize) -> Result<String, Error> {
         self.check_bound(len)?;
-        read_utf8_simd(self, len)
+
+        if len < SIMD_THRESHOLD {
+            // Fast path for small strings - direct copy avoids SIMD overhead
+            // SAFETY: bounds already checked, assuming valid UTF-8 (caller's 
responsibility)
+            unsafe {
+                let mut vec = Vec::with_capacity(len);
+                let src = self.bf.add(self.cursor);
+                let dst = vec.as_mut_ptr();
+                // Use fastest possible copy - copy_nonoverlapping compiles to 
memcpy
+                std::ptr::copy_nonoverlapping(src, dst, len);
+                vec.set_len(len);
+                self.move_next(len);
+                // SAFETY: Assuming valid UTF-8 bytes (responsibility of 
serialization protocol)
+                Ok(String::from_utf8_unchecked(vec))
+            }
+        } else {
+            // Use SIMD for larger strings where the overhead is amortized
+            read_utf8_simd(self, len)
+        }
     }
 
     #[inline(always)]
     pub fn read_utf16_string(&mut self, len: usize) -> Result<String, Error> {
         self.check_bound(len)?;
+        if len < SIMD_THRESHOLD {
+            // Fast path for small UTF-16 strings - direct copy
+            unsafe {
+                let slice = 
std::slice::from_raw_parts(self.bf.add(self.cursor), len);
+                let units: Vec<u16> = slice
+                    .chunks_exact(2)
+                    .map(|c| u16::from_le_bytes([c[0], c[1]]))
+                    .collect();
+                self.move_next(len);
+                return Ok(String::from_utf16_lossy(&units));
+            }
+        }
         read_utf16_simd(self, len)
     }
 
diff --git a/rust/fory-core/src/fory.rs b/rust/fory-core/src/fory.rs
index 0f7528d78..b00d0e07a 100644
--- a/rust/fory-core/src/fory.rs
+++ b/rust/fory-core/src/fory.rs
@@ -90,7 +90,7 @@ impl Default for Fory {
     fn default() -> Self {
         Fory {
             compatible: false,
-            xlang: true,
+            xlang: false,
             share_meta: false,
             type_resolver: TypeResolver::default(),
             compress_string: false,
@@ -156,7 +156,7 @@ impl Fory {
     ///
     /// # Default
     ///
-    /// The default value is `true`.
+    /// The default value is `false`.
     ///
     /// # Examples
     ///
@@ -166,7 +166,8 @@ impl Fory {
     /// // For cross-language use (default)
     /// let fory = Fory::default().xlang(true);
     ///
-    /// // For Rust-only optimization
+    /// // For Rust-only optimization, this mode is faster and more compact 
since it avoids
+    /// // cross-language metadata and type system costs.
     /// let fory = Fory::default().xlang(false);
     /// ```
     pub fn xlang(mut self, xlang: bool) -> Self {
diff --git a/rust/fory-core/src/meta/string_util.rs 
b/rust/fory-core/src/meta/string_util.rs
index 4b4692d62..8eda15d57 100644
--- a/rust/fory-core/src/meta/string_util.rs
+++ b/rust/fory-core/src/meta/string_util.rs
@@ -602,12 +602,10 @@ pub mod buffer_rw_string {
     #[inline]
     fn write_bytes_simd(writer: &mut Writer, bytes: &[u8]) {
         let len = bytes.len();
-        let mut i = 0usize;
-
         if len == 0 {
             return;
         }
-
+        let mut i = 0usize;
         writer.bf.reserve(len);
 
         #[cfg(any(
@@ -685,21 +683,83 @@ pub mod buffer_rw_string {
         }
     }
 
+    #[inline]
+    fn is_ascii_bytes(bytes: &[u8]) -> bool {
+        let len = bytes.len();
+        let mut i = 0;
+
+        #[cfg(target_arch = "x86_64")]
+        unsafe {
+            if is_x86_feature_detected!("avx2") && len >= 32 {
+                while i + 32 <= len {
+                    let chunk = _mm256_loadu_si256(bytes.as_ptr().add(i) as 
*const __m256i);
+                    let mask = _mm256_movemask_epi8(chunk);
+                    if mask != 0 {
+                        return false;
+                    }
+                    i += 32;
+                }
+            }
+        }
+
+        #[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
+        unsafe {
+            if is_x86_feature_detected!("sse2") && len >= 16 {
+                while i + 16 <= len {
+                    let chunk = _mm_loadu_si128(bytes.as_ptr().add(i) as 
*const __m128i);
+                    let mask = _mm_movemask_epi8(chunk);
+                    if mask != 0 {
+                        return false;
+                    }
+                    i += 16;
+                }
+            }
+        }
+
+        #[cfg(target_arch = "aarch64")]
+        unsafe {
+            if std::arch::is_aarch64_feature_detected!("neon") && len >= 16 {
+                while i + 16 <= len {
+                    let chunk = vld1q_u8(bytes.as_ptr().add(i));
+                    if vmaxvq_u8(chunk) >= 0x80 {
+                        return false;
+                    }
+                    i += 16;
+                }
+            }
+        }
+
+        // Scalar fallback
+        bytes[i..].iter().all(|&b| b < 0x80)
+    }
+
     #[inline]
     pub fn write_latin1_simd(writer: &mut Writer, s: &str) {
         if s.is_empty() {
             return;
         }
-        let mut buf: Vec<u8> = Vec::with_capacity(s.len());
-        for c in s.chars() {
-            let v = c as u32;
-            assert!(v <= 0xFF, "Non-Latin1 character found");
-            buf.push(v as u8);
+
+        let bytes = s.as_bytes();
+
+        // CRITICAL OPTIMIZATION: For ASCII strings, UTF-8 bytes == Latin1 
bytes
+        // Check if all ASCII using SIMD
+        if is_ascii_bytes(bytes) {
+            // Zero-copy fast path: direct write
+            write_bytes_simd(writer, bytes);
+        } else {
+            // Non-ASCII: Must iterate chars to extract Latin1 byte values
+            // Example: 'À' in Rust String is UTF-8 [0xC3, 0x80] but Latin1 is 
[0xC0]
+            let mut buf: Vec<u8> = Vec::with_capacity(s.len());
+            for c in s.chars() {
+                let v = c as u32;
+                assert!(v <= 0xFF, "Non-Latin1 character found");
+                buf.push(v as u8);
+            }
+            write_bytes_simd(writer, &buf);
         }
-        write_bytes_simd(writer, &buf);
     }
 
-    #[inline]
+    #[inline(always)]
     pub fn write_utf8_simd(writer: &mut Writer, s: &str) {
         let bytes = s.as_bytes();
         write_bytes_simd(writer, bytes);
@@ -776,12 +836,15 @@ pub mod buffer_rw_string {
         }
         let src = unsafe { 
std::slice::from_raw_parts(reader.bf.add(reader.cursor), len) };
 
-        let mut out: Vec<u8> = Vec::with_capacity(len + len / 4);
+        // Pessimistic allocation: Latin1 0x80-0xFF expands to 2 bytes in UTF-8
+        let mut out: Vec<u8> = Vec::with_capacity(len * 2);
 
         unsafe {
+            let out_ptr = out.as_mut_ptr();
+            let mut out_len = 0usize;
             let mut i = 0usize;
 
-            // ---- AVX2 fast-path (32 bytes) ----
+            // ---- AVX2 fast-path: process 32 ASCII bytes at once ----
             #[cfg(target_arch = "x86_64")]
             {
                 if std::arch::is_x86_feature_detected!("avx2") {
@@ -791,19 +854,20 @@ pub mod buffer_rw_string {
                         let chunk = _mm256_loadu_si256(ptr);
                         let mask = _mm256_movemask_epi8(chunk);
                         if mask == 0 {
-                            let mut buf32: [u8; 32] = std::mem::zeroed();
-                            _mm256_storeu_si256(buf32.as_mut_ptr() as *mut 
__m256i, chunk);
-                            out.extend_from_slice(&buf32);
+                            // All ASCII: direct copy (no conversion needed)
+                            _mm256_storeu_si256(out_ptr.add(out_len) as *mut 
__m256i, chunk);
+                            out_len += 32;
                             i += 32;
                             continue;
                         } else {
+                            // Contains Latin1 bytes, break to scalar
                             break;
                         }
                     }
                 }
             }
 
-            // ---- SSE2 fast-path (16 bytes) ----
+            // ---- SSE2 fast-path: process 16 ASCII bytes at once ----
             #[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
             {
                 if std::arch::is_x86_feature_detected!("sse2") {
@@ -813,9 +877,9 @@ pub mod buffer_rw_string {
                         let chunk = _mm_loadu_si128(ptr);
                         let mask = _mm_movemask_epi8(chunk);
                         if mask == 0 {
-                            let mut buf16: [u8; 16] = std::mem::zeroed();
-                            _mm_storeu_si128(buf16.as_mut_ptr() as *mut 
__m128i, chunk);
-                            out.extend_from_slice(&buf16);
+                            // All ASCII: direct copy
+                            _mm_storeu_si128(out_ptr.add(out_len) as *mut 
__m128i, chunk);
+                            out_len += 16;
                             i += 16;
                             continue;
                         } else {
@@ -825,7 +889,7 @@ pub mod buffer_rw_string {
                 }
             }
 
-            // ---- NEON fast-path (16 bytes) ----
+            // ---- NEON fast-path: process 16 ASCII bytes at once ----
             #[cfg(target_arch = "aarch64")]
             {
                 if std::arch::is_aarch64_feature_detected!("neon") {
@@ -833,15 +897,11 @@ pub mod buffer_rw_string {
                     while i + 16 <= len {
                         let ptr = src.as_ptr().add(i);
                         let v = vld1q_u8(ptr);
-                        let cmp = vcgeq_u8(v, vdupq_n_u8(128));
-
-                        let mut mask_arr: [u8; 16] = std::mem::zeroed();
-                        vst1q_u8(mask_arr.as_mut_ptr(), cmp);
-
-                        if mask_arr.iter().all(|&x| x == 0) {
-                            let mut buf16: [u8; 16] = std::mem::zeroed();
-                            vst1q_u8(buf16.as_mut_ptr(), v);
-                            out.extend_from_slice(&buf16);
+                        // Check if any byte >= 0x80
+                        if vmaxvq_u8(v) < 0x80 {
+                            // All ASCII: direct copy
+                            vst1q_u8(out_ptr.add(out_len), v);
+                            out_len += 16;
                             i += 16;
                             continue;
                         } else {
@@ -851,17 +911,25 @@ pub mod buffer_rw_string {
                 }
             }
 
-            // ---- scalar fallback for remaining bytes ----
+            // ---- Scalar fallback: convert Latin1 -> UTF-8 ----
+            // ASCII (0x00-0x7F): copy as-is
+            // Latin1 (0x80-0xFF): encode as 2-byte UTF-8
             while i < len {
                 let b = *src.get_unchecked(i);
                 if b < 0x80 {
-                    out.push(b);
+                    *out_ptr.add(out_len) = b;
+                    out_len += 1;
                 } else {
-                    out.push(0xC0 | (b >> 6));
-                    out.push(0x80 | (b & 0x3F));
+                    // Latin1 byte 0x80-0xFF -> UTF-8 encoding
+                    // Example: 0xC0 (À) -> [0xC3, 0x80]
+                    *out_ptr.add(out_len) = 0xC0 | (b >> 6);
+                    *out_ptr.add(out_len + 1) = 0x80 | (b & 0x3F);
+                    out_len += 2;
                 }
                 i += 1;
             }
+
+            out.set_len(out_len);
         }
         reader.move_next(len);
         Ok(unsafe { String::from_utf8_unchecked(out) })
@@ -872,25 +940,28 @@ pub mod buffer_rw_string {
         if len == 0 {
             return Ok(String::new());
         }
-
         let src = unsafe { 
std::slice::from_raw_parts(reader.bf.add(reader.cursor), len) };
-        let mut result = String::with_capacity(len);
+
+        // CRITICAL OPTIMIZATION: Allocate Vec once, SIMD copy directly, 
single String construction
+        // Eliminates multiple push_str copies
+        let mut vec = Vec::with_capacity(len);
 
         unsafe {
+            let dst: *mut u8 = vec.as_mut_ptr();
             let mut i = 0usize;
 
+            // ---- AVX2 path: 32-byte chunks ----
             #[cfg(all(target_arch = "x86_64", target_feature = "avx2"))]
             {
                 const CHUNK: usize = 32;
                 while i + CHUNK <= len {
                     let chunk = _mm256_loadu_si256(src.as_ptr().add(i) as 
*const __m256i);
-                    let mut buf = [0u8; CHUNK];
-                    _mm256_storeu_si256(buf.as_mut_ptr() as *mut __m256i, 
chunk);
-                    result.push_str(std::str::from_utf8_unchecked(&buf));
+                    _mm256_storeu_si256(dst.add(i) as *mut __m256i, chunk);
                     i += CHUNK;
                 }
             }
 
+            // ---- SSE2 path: 16-byte chunks ----
             #[cfg(all(
                 any(target_arch = "x86", target_arch = "x86_64"),
                 target_feature = "sse2",
@@ -900,32 +971,33 @@ pub mod buffer_rw_string {
                 const CHUNK: usize = 16;
                 while i + CHUNK <= len {
                     let chunk = _mm_loadu_si128(src.as_ptr().add(i) as *const 
__m128i);
-                    let mut buf = [0u8; CHUNK];
-                    _mm_storeu_si128(buf.as_mut_ptr() as *mut __m128i, chunk);
-                    result.push_str(std::str::from_utf8_unchecked(&buf));
+                    _mm_storeu_si128(dst.add(i) as *mut __m128i, chunk);
                     i += CHUNK;
                 }
             }
 
+            // ---- NEON path: 16-byte chunks ----
             #[cfg(all(target_arch = "aarch64", target_feature = "neon"))]
             {
                 const CHUNK: usize = 16;
                 while i + CHUNK <= len {
                     let chunk = vld1q_u8(src.as_ptr().add(i));
-                    let mut buf = [0u8; CHUNK];
-                    vst1q_u8(buf.as_mut_ptr(), chunk);
-                    result.push_str(std::str::from_utf8_unchecked(&buf));
+                    vst1q_u8(dst.add(i), chunk);
                     i += CHUNK;
                 }
             }
 
+            // ---- Copy remaining bytes ----
             if i < len {
-                result.push_str(std::str::from_utf8_unchecked(&src[i..len]));
+                std::ptr::copy_nonoverlapping(src.as_ptr().add(i), dst.add(i), 
len - i);
             }
+
+            vec.set_len(len);
         }
 
         reader.move_next(len);
-        Ok(result)
+        // Single String construction - no intermediate copies!
+        Ok(unsafe { String::from_utf8_unchecked(vec) })
     }
 
     #[inline]
diff --git a/rust/fory-core/src/resolver/ref_resolver.rs 
b/rust/fory-core/src/resolver/ref_resolver.rs
index 84eaa4b26..951f640ef 100644
--- a/rust/fory-core/src/resolver/ref_resolver.rs
+++ b/rust/fory-core/src/resolver/ref_resolver.rs
@@ -79,6 +79,7 @@ impl RefWriter {
     ///
     /// * `true` if a reference was written
     /// * `false` if this is the first occurrence of the object
+    #[inline]
     pub fn try_write_rc_ref<T: ?Sized>(&mut self, writer: &mut Writer, rc: 
&Rc<T>) -> bool {
         let ptr_addr = Rc::as_ptr(rc) as *const () as usize;
 
@@ -110,6 +111,7 @@ impl RefWriter {
     ///
     /// * `true` if a reference was written
     /// * `false` if this is the first occurrence of the object
+    #[inline]
     pub fn try_write_arc_ref<T: ?Sized>(&mut self, writer: &mut Writer, arc: 
&Arc<T>) -> bool {
         let ptr_addr = Arc::as_ptr(arc) as *const () as usize;
 
@@ -131,6 +133,7 @@ impl RefWriter {
     /// Clear all stored references.
     ///
     /// This is useful for reusing the RefWriter for multiple serialization 
operations.
+    #[inline(always)]
     pub fn reset(&mut self) {
         self.refs.clear();
         self.next_ref_id = 0;
@@ -181,6 +184,7 @@ impl RefReader {
     /// Reserve a reference ID slot without storing anything yet.
     ///
     /// Returns the reserved reference ID that will be used when storing the 
object later.
+    #[inline(always)]
     pub fn reserve_ref_id(&mut self) -> u32 {
         let ref_id = self.refs.len() as u32;
         self.refs.push(Box::new(()));
@@ -193,6 +197,7 @@ impl RefReader {
     ///
     /// * `ref_id` - The reference ID that was reserved
     /// * `rc` - The Rc to store
+    #[inline(always)]
     pub fn store_rc_ref_at<T: 'static + ?Sized>(&mut self, ref_id: u32, rc: 
Rc<T>) {
         self.refs[ref_id as usize] = Box::new(rc);
     }
@@ -206,6 +211,7 @@ impl RefReader {
     /// # Returns
     ///
     /// The reference ID that can be used to retrieve this object later
+    #[inline(always)]
     pub fn store_rc_ref<T: 'static + ?Sized>(&mut self, rc: Rc<T>) -> u32 {
         let ref_id = self.refs.len() as u32;
         self.refs.push(Box::new(rc));
@@ -231,6 +237,7 @@ impl RefReader {
     /// # Returns
     ///
     /// The reference ID that can be used to retrieve this object later
+    #[inline(always)]
     pub fn store_arc_ref<T: 'static + ?Sized>(&mut self, arc: Arc<T>) -> u32 {
         let ref_id = self.refs.len() as u32;
         self.refs.push(Box::new(arc));
@@ -247,6 +254,7 @@ impl RefReader {
     ///
     /// * `Some(Rc<T>)` if the reference ID is valid and the type matches
     /// * `None` if the reference ID is invalid or the type doesn't match
+    #[inline(always)]
     pub fn get_rc_ref<T: 'static + ?Sized>(&self, ref_id: u32) -> 
Option<Rc<T>> {
         let any_box = self.refs.get(ref_id as usize)?;
         any_box.downcast_ref::<Rc<T>>().cloned()
@@ -262,6 +270,7 @@ impl RefReader {
     ///
     /// * `Some(Arc<T>)` if the reference ID is valid and the type matches
     /// * `None` if the reference ID is invalid or the type doesn't match
+    #[inline(always)]
     pub fn get_arc_ref<T: 'static + ?Sized>(&self, ref_id: u32) -> 
Option<Arc<T>> {
         let any_box = self.refs.get(ref_id as usize)?;
         any_box.downcast_ref::<Arc<T>>().cloned()
@@ -272,6 +281,7 @@ impl RefReader {
     /// # Arguments
     ///
     /// * `callback` - A closure that takes a reference to the RefReader
+    #[inline(always)]
     pub fn add_callback(&mut self, callback: UpdateCallback) {
         self.callbacks.push(callback);
     }
@@ -289,6 +299,7 @@ impl RefReader {
     /// # Errors
     ///
     /// Errors if an invalid reference flag value is encountered
+    #[inline(always)]
     pub fn read_ref_flag(&self, reader: &mut Reader) -> Result<RefFlag, Error> 
{
         let flag_value = reader.read_i8()?;
         Ok(match flag_value {
@@ -312,6 +323,7 @@ impl RefReader {
     /// # Returns
     ///
     /// The reference ID as a u32
+    #[inline(always)]
     pub fn read_ref_id(&self, reader: &mut Reader) -> Result<u32, Error> {
         reader.read_varuint32()
     }
@@ -320,6 +332,7 @@ impl RefReader {
     ///
     /// This should be called after deserialization completes to update any 
weak pointers
     /// that referenced objects which were not yet available during 
deserialization.
+    #[inline(always)]
     pub fn resolve_callbacks(&mut self) {
         let callbacks = std::mem::take(&mut self.callbacks);
         for callback in callbacks {
@@ -330,6 +343,7 @@ impl RefReader {
     /// Clear all stored references and callbacks.
     ///
     /// This is useful for reusing the RefReader for multiple deserialization 
operations.
+    #[inline(always)]
     pub fn reset(&mut self) {
         self.resolve_callbacks();
         self.refs.clear();
diff --git a/rust/fory-core/src/resolver/type_resolver.rs 
b/rust/fory-core/src/resolver/type_resolver.rs
index c527080c2..3f39aa228 100644
--- a/rust/fory-core/src/resolver/type_resolver.rs
+++ b/rust/fory-core/src/resolver/type_resolver.rs
@@ -69,22 +69,27 @@ impl Harness {
         }
     }
 
+    #[inline(always)]
     pub fn get_write_fn(&self) -> WriteFn {
         self.write_fn
     }
 
+    #[inline(always)]
     pub fn get_read_fn(&self) -> ReadFn {
         self.read_fn
     }
 
+    #[inline(always)]
     pub fn get_write_data_fn(&self) -> WriteDataFn {
         self.write_data_fn
     }
 
+    #[inline(always)]
     pub fn get_read_data_fn(&self) -> ReadDataFn {
         self.read_data_fn
     }
 
+    #[inline(always)]
     pub fn get_to_serializer(&self) -> ToSerializerFn {
         self.to_serializer
     }
@@ -186,30 +191,37 @@ impl TypeInfo {
         })
     }
 
+    #[inline(always)]
     pub fn get_type_id(&self) -> u32 {
         self.type_id
     }
 
+    #[inline(always)]
     pub fn get_namespace(&self) -> Rc<MetaString> {
         self.namespace.clone()
     }
 
+    #[inline(always)]
     pub fn get_type_name(&self) -> Rc<MetaString> {
         self.type_name.clone()
     }
 
+    #[inline(always)]
     pub fn get_type_def(&self) -> Rc<Vec<u8>> {
         self.type_def.clone()
     }
 
+    #[inline(always)]
     pub fn get_type_meta(&self) -> Rc<TypeMeta> {
         self.type_meta.clone()
     }
 
+    #[inline(always)]
     pub fn is_registered_by_name(&self) -> bool {
         self.register_by_name
     }
 
+    #[inline(always)]
     pub fn get_harness(&self) -> &Harness {
         &self.harness
     }
@@ -335,16 +347,19 @@ impl TypeResolver {
             .cloned()
     }
 
+    #[inline(always)]
     pub fn get_type_info_by_id(&self, id: u32) -> Option<Rc<TypeInfo>> {
         self.type_info_map_by_id.get(&id).cloned()
     }
 
+    #[inline(always)]
     pub fn get_type_info_by_name(&self, namespace: &str, type_name: &str) -> 
Option<Rc<TypeInfo>> {
         self.type_info_map_by_name
             .get(&(namespace.to_owned(), type_name.to_owned()))
             .cloned()
     }
 
+    #[inline(always)]
     pub fn get_type_info_by_msname(
         &self,
         namespace: Rc<MetaString>,
@@ -356,6 +371,7 @@ impl TypeResolver {
     }
 
     /// Fast path for getting type info by numeric ID (avoids HashMap lookup 
by TypeId)
+    #[inline(always)]
     pub fn get_type_id(&self, type_id: &std::any::TypeId, id: u32) -> 
Result<u32, Error> {
         let id_usize = id as usize;
         if id_usize < self.type_id_index.len() {
@@ -370,12 +386,14 @@ impl TypeResolver {
         )))
     }
 
+    #[inline(always)]
     pub fn get_harness(&self, id: u32) -> Option<Rc<Harness>> {
         self.type_info_map_by_id
             .get(&id)
             .map(|info| Rc::new(info.get_harness().clone()))
     }
 
+    #[inline(always)]
     pub fn get_name_harness(
         &self,
         namespace: Rc<MetaString>,
@@ -387,6 +405,7 @@ impl TypeResolver {
             .map(|info| Rc::new(info.get_harness().clone()))
     }
 
+    #[inline(always)]
     pub fn get_ext_harness(&self, id: u32) -> Result<Rc<Harness>, Error> {
         self.type_info_map_by_id
             .get(&id)
@@ -394,6 +413,7 @@ impl TypeResolver {
             .ok_or_else(|| Error::type_error("ext type must be registered in 
both peers"))
     }
 
+    #[inline(always)]
     pub fn get_ext_name_harness(
         &self,
         namespace: Rc<MetaString>,
@@ -406,6 +426,7 @@ impl TypeResolver {
             .ok_or_else(|| Error::type_error("named_ext type must be 
registered in both peers"))
     }
 
+    #[inline(always)]
     pub fn get_fory_type_id(&self, rust_type_id: std::any::TypeId) -> 
Option<u32> {
         self.type_info_map
             .get(&rust_type_id)
diff --git a/rust/fory-core/src/serializer/core.rs 
b/rust/fory-core/src/serializer/core.rs
index 1ab796b5e..72e329afb 100644
--- a/rust/fory-core/src/serializer/core.rs
+++ b/rust/fory-core/src/serializer/core.rs
@@ -246,6 +246,7 @@ pub trait Serializer: 'static {
     /// [`fory_write_data`]: Serializer::fory_write_data
     /// [`fory_write_type_info`]: Serializer::fory_write_type_info
     /// [`fory_write_data_generic`]: Serializer::fory_write_data_generic
+    #[inline(always)]
     fn fory_write(
         &self,
         context: &mut WriteContext,
@@ -304,6 +305,7 @@ pub trait Serializer: 'static {
     /// - Focus on implementing [`fory_write_data`] for custom types
     ///
     /// [`fory_write_data`]: Serializer::fory_write_data
+    #[inline(always)]
     #[allow(unused_variables)]
     fn fory_write_data_generic(
         &self,
@@ -574,6 +576,7 @@ pub trait Serializer: 'static {
     /// [`fory_read_data`]: Serializer::fory_read_data
     /// [`fory_read_type_info`]: Serializer::fory_read_type_info
     /// [`fory_write`]: Serializer::fory_write
+    #[inline(always)]
     fn fory_read(
         context: &mut ReadContext,
         read_ref_info: bool,
@@ -658,6 +661,7 @@ pub trait Serializer: 'static {
     /// - User types with custom serialization rarely need to override this
     ///
     /// [`fory_read`]: Serializer::fory_read
+    #[inline(always)]
     #[allow(unused_variables)]
     fn fory_read_with_type_info(
         context: &mut ReadContext,
@@ -1363,6 +1367,7 @@ pub trait StructSerializer: Serializer + 'static {
     /// - Default delegates to `struct_::actual_type_id`
     /// - Handles type ID transformations for compatibility
     /// - **Do not override** for user types with custom serialization (EXT 
types)
+    #[inline(always)]
     fn fory_actual_type_id(type_id: u32, register_by_name: bool, compatible: 
bool) -> u32 {
         struct_::actual_type_id(type_id, register_by_name, compatible)
     }
diff --git a/rust/fory-core/src/serializer/number.rs 
b/rust/fory-core/src/serializer/number.rs
index f5d53754d..4e6a45b0c 100644
--- a/rust/fory-core/src/serializer/number.rs
+++ b/rust/fory-core/src/serializer/number.rs
@@ -27,53 +27,55 @@ use crate::types::TypeId;
 macro_rules! impl_num_serializer {
     ($ty:ty, $writer:expr, $reader:expr, $field_type:expr) => {
         impl Serializer for $ty {
-            #[inline]
+            #[inline(always)]
             fn fory_write_data(&self, context: &mut WriteContext) -> 
Result<(), Error> {
                 $writer(&mut context.writer, *self);
                 Ok(())
             }
 
-            #[inline]
+            #[inline(always)]
             fn fory_read_data(context: &mut ReadContext) -> Result<Self, 
Error> {
                 $reader(&mut context.reader)
             }
 
-            #[inline]
+            #[inline(always)]
             fn fory_reserved_space() -> usize {
                 std::mem::size_of::<$ty>()
             }
 
-            #[inline]
+            #[inline(always)]
             fn fory_get_type_id(_: &TypeResolver) -> Result<u32, Error> {
                 Ok($field_type as u32)
             }
 
+            #[inline(always)]
             fn fory_type_id_dyn(&self, _: &TypeResolver) -> Result<u32, Error> 
{
                 Ok($field_type as u32)
             }
 
+            #[inline(always)]
             fn fory_static_type_id() -> TypeId {
                 $field_type
             }
 
-            #[inline]
+            #[inline(always)]
             fn as_any(&self) -> &dyn std::any::Any {
                 self
             }
 
-            #[inline]
+            #[inline(always)]
             fn fory_write_type_info(context: &mut WriteContext) -> Result<(), 
Error> {
                 context.writer.write_varuint32($field_type as u32);
                 Ok(())
             }
 
-            #[inline]
+            #[inline(always)]
             fn fory_read_type_info(context: &mut ReadContext) -> Result<(), 
Error> {
                 read_basic_type_info::<Self>(context)
             }
         }
         impl ForyDefault for $ty {
-            #[inline]
+            #[inline(always)]
             fn fory_default() -> Self {
                 0 as $ty
             }
diff --git a/rust/fory-core/src/serializer/string.rs 
b/rust/fory-core/src/serializer/string.rs
index d6b7c2fa6..be457cbfb 100644
--- a/rust/fory-core/src/serializer/string.rs
+++ b/rust/fory-core/src/serializer/string.rs
@@ -32,8 +32,16 @@ enum StrEncoding {
 }
 
 impl Serializer for String {
-    #[inline]
+    #[inline(always)]
     fn fory_write_data(&self, context: &mut WriteContext) -> Result<(), Error> 
{
+        if !context.is_xlang() {
+            // Fast path: non-xlang mode always uses UTF-8 without encoding 
header
+            context.writer.write_varuint32(self.len() as u32);
+            context.writer.write_utf8_string(self);
+            return Ok(());
+        }
+
+        // xlang mode: use encoding header for optimal format selection
         let mut len = get_latin1_length(self);
         if len >= 0 {
             let bitor = (len as u64) << 2 | StrEncoding::Latin1 as u64;
@@ -54,8 +62,15 @@ impl Serializer for String {
         Ok(())
     }
 
-    #[inline]
+    #[inline(always)]
     fn fory_read_data(context: &mut ReadContext) -> Result<Self, Error> {
+        if !context.is_xlang() {
+            // Fast path: non-xlang mode always uses UTF-8 without encoding 
header
+            let len = context.reader.read_varuint32()? as usize;
+            return context.reader.read_utf8_string(len);
+        }
+
+        // xlang mode: read encoding header and decode accordingly
         let bitor = context.reader.read_varuint36small()?;
         let len = bitor >> 2;
         let encoding = bitor & 0b11;
@@ -78,7 +93,7 @@ impl Serializer for String {
         Ok(s)
     }
 
-    #[inline]
+    #[inline(always)]
     fn fory_reserved_space() -> usize {
         mem::size_of::<i32>()
     }
diff --git a/rust/fory-derive/src/object/serializer.rs 
b/rust/fory-derive/src/object/serializer.rs
index 20893654c..29ed1b760 100644
--- a/rust/fory-derive/src/object/serializer.rs
+++ b/rust/fory-derive/src/object/serializer.rs
@@ -125,10 +125,12 @@ pub fn derive_serializer(ast: &syn::DeriveInput, 
debug_enabled: bool) -> TokenSt
         #default_impl
 
         impl fory_core::StructSerializer for #name {
+            #[inline(always)]
             fn fory_type_index() -> u32 {
                 #type_idx
             }
 
+            #[inline(always)]
             fn fory_actual_type_id(type_id: u32, register_by_name: bool, 
compatible: bool) -> u32 {
                 #actual_type_id_ts
             }
@@ -141,24 +143,29 @@ pub fn derive_serializer(ast: &syn::DeriveInput, 
debug_enabled: bool) -> TokenSt
                 #fields_info_ts
             }
 
+            #[inline]
             fn fory_read_compatible(context: &mut 
fory_core::resolver::context::ReadContext, type_info: 
std::rc::Rc<fory_core::TypeInfo>) -> Result<Self, fory_core::error::Error> {
                 #read_compatible_ts
             }
         }
 
         impl fory_core::Serializer for #name {
+            #[inline(always)]
             fn fory_get_type_id(type_resolver: 
&fory_core::resolver::type_resolver::TypeResolver) -> Result<u32, 
fory_core::error::Error> {
                 type_resolver.get_type_id(&std::any::TypeId::of::<Self>(), 
#type_idx)
             }
 
+            #[inline(always)]
             fn fory_type_id_dyn(&self, type_resolver: 
&fory_core::resolver::type_resolver::TypeResolver) -> Result<u32, 
fory_core::error::Error> {
                 Self::fory_get_type_id(type_resolver)
             }
 
+            #[inline(always)]
             fn as_any(&self) -> &dyn std::any::Any {
                 self
             }
 
+            #[inline(always)]
             fn fory_static_type_id() -> fory_core::TypeId
             where
                 Self: Sized,
@@ -166,34 +173,42 @@ pub fn derive_serializer(ast: &syn::DeriveInput, 
debug_enabled: bool) -> TokenSt
                 #static_type_id_ts
             }
 
+            #[inline(always)]
             fn fory_reserved_space() -> usize {
                 #reserved_space_ts
             }
 
+            #[inline(always)]
             fn fory_write(&self, context: &mut 
fory_core::resolver::context::WriteContext, write_ref_info: bool, 
write_type_info: bool, _: bool) -> Result<(), fory_core::error::Error> {
                 #write_ts
             }
 
+            #[inline]
             fn fory_write_data(&self, context: &mut 
fory_core::resolver::context::WriteContext) -> Result<(), 
fory_core::error::Error> {
                 #write_data_ts
             }
 
+            #[inline(always)]
             fn fory_write_type_info(context: &mut 
fory_core::resolver::context::WriteContext) -> Result<(), 
fory_core::error::Error> {
                 #write_type_info_ts
             }
 
+            #[inline(always)]
             fn fory_read(context: &mut 
fory_core::resolver::context::ReadContext, read_ref_info: bool, read_type_info: 
bool) -> Result<Self, fory_core::error::Error> {
                 #read_ts
             }
 
+            #[inline(always)]
             fn fory_read_with_type_info(context: &mut 
fory_core::resolver::context::ReadContext, read_ref_info: bool, type_info: 
std::rc::Rc<fory_core::TypeInfo>) -> Result<Self, fory_core::error::Error> {
                 #read_with_type_info_ts
             }
 
+            #[inline]
             fn fory_read_data( context: &mut 
fory_core::resolver::context::ReadContext) -> Result<Self, 
fory_core::error::Error> {
                 #read_data_ts
             }
 
+            #[inline(always)]
             fn fory_read_type_info(context: &mut 
fory_core::resolver::context::ReadContext) -> Result<(), 
fory_core::error::Error> {
                 #read_type_info_ts
             }


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(fory) branch main updated: perf(rust): optimize rust small string/struct read/write performance (#2803)

Reply via email to