[PR] feat(c++): Support the UTF-8 to UTF-16 with SIMD [fury]

via GitHub Wed, 25 Dec 2024 06:26:25 -0800


pandalee99 opened a new pull request, #1990:
URL: https://github.com/apache/fury/pull/1990


   <!--
   **Thanks for contributing to Fury.**
   
   **If this is your first time opening a PR on fury, you can refer to 
[CONTRIBUTING.md](https://github.com/apache/fury/blob/main/CONTRIBUTING.md).**
   
   Contribution Checklist
   
       - The **Apache Fury (incubating)** community has restrictions on the 
naming of pr titles. You can also find instructions in 
[CONTRIBUTING.md](https://github.com/apache/fury/blob/main/CONTRIBUTING.md).
   
       - Fury has a strong focus on performance. If the PR you submit will have 
an impact on performance, please benchmark it first and provide the benchmark 
result here.
   -->
   
   ## What does this PR do?
   To support the utf8 utf16 and using simd to accelerate the optimization
   ``` c++
   std::string utf16ToUtf8(const std::u16string &utf16, bool is_little_endian)
   ```
   
   The logic of converting UTF-8 to UTF-16 isn't that complicated. but there 
are still lots of optimizations that I haven't come up with yet.
   
   So, I'll first design a version that's a bit faster than the original one, 
and then think about how to make further optimizations.
   
   Judging from the tests, the logic is correct:
   ``` text 
   [----------] 9 tests from UTF8ToUTF16Test
   [ RUN      ] UTF8ToUTF16Test.BasicConversion
   [       OK ] UTF8ToUTF16Test.BasicConversion (0 ms)
   [ RUN      ] UTF8ToUTF16Test.EmptyString
   [       OK ] UTF8ToUTF16Test.EmptyString (0 ms)
   [ RUN      ] UTF8ToUTF16Test.SurrogatePairs
   [       OK ] UTF8ToUTF16Test.SurrogatePairs (0 ms)
   [ RUN      ] UTF8ToUTF16Test.BoundaryValues
   [       OK ] UTF8ToUTF16Test.BoundaryValues (0 ms)
   [ RUN      ] UTF8ToUTF16Test.SpecialCharacters
   [       OK ] UTF8ToUTF16Test.SpecialCharacters (0 ms)
   [ RUN      ] UTF8ToUTF16Test.LittleEndian
   [       OK ] UTF8ToUTF16Test.LittleEndian (0 ms)
   [ RUN      ] UTF8ToUTF16Test.BigEndian
   [       OK ] UTF8ToUTF16Test.BigEndian (0 ms)
   [ RUN      ] UTF8ToUTF16Test.RoundTripConversion
   [       OK ] UTF8ToUTF16Test.RoundTripConversion (0 ms)
   ```
   <img width="264" alt="image" 
src="https://github.com/user-attachments/assets/7b9033ad-001f-4a36-a27e-6a8362f3a6df";
 />
   
   
   
   And from the performance perspective, it's improved compared to serial 
processing:
   <img width="512" alt="image" 
src="https://github.com/user-attachments/assets/1366c11a-0a03-429d-b641-26cb9d3241f5";
 />
   
   The speed of execution has been significantly improved
   
   
   Actually, this code doesn't use libraries like AVX2 or really apply SIMD to 
process. The main reason is that the structure of UTF-8 encoding is complex and 
not fixed. It involves multi-byte encoding, and we need to analyze it byte by 
byte when dealing with different bytes. So, without clear rules and a uniform 
length, it becomes really hard to directly parallelize the processing of each 
byte. During the process of converting UTF-8 to UTF-16, we have to handle 
characters of different lengths, ranging from 1 to 4 bytes, which makes it 
difficult to break it down into structures that can be directly applied to SIMD 
operations. 
   There are also some code style changes, uniform writing
   <!-- Describe the purpose of this PR. -->
   
   ## Related issues
   
   Close #1964 
   
   <!--
   Is there any related issue? Please attach here.
   
   - #xxxx0
   - #xxxx1
   - #xxxx2
   -->
   
   ## Does this PR introduce any user-facing change?
   
   <!--
   If any user-facing interface changes, please [open an 
issue](https://github.com/apache/fury/issues/new/choose) describing the need to 
do so and update the document if necessary.
   -->
   
   - [ ] Does this PR introduce any public API change?
   - [ ] Does this PR introduce any binary protocol compatibility change?
   
   ## Benchmark
   
   <!--
   When the PR has an impact on performance (if you don't know whether the PR 
will have an impact on performance, you can submit the PR first, and if it will 
have impact on performance, the code reviewer will explain it), be sure to 
attach a benchmark data here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat(c++): Support the UTF-8 to UTF-16 with SIMD [fury]

Reply via email to