cpp: Add UTF-16/UTF-32 encoding auto-detection in C preprocessor

katahiromz Wed, 26 Nov 2025 16:34:57 -0800

Hello, I'm katahiromz. Thank you for your great software.
I want to add UTF-16/UTF-32 support to your C preprocessor.


This patch (attached) might add automatic character encoding detection
to `libcpp/files.cc` by examining the first 4 bytes of input files.
I hope this patch helps.

---
Technical information:

**Detection logic in `read_file_guts`:**

- Binary files (all zeros in first 4 bytes) --> error
- BOM detection:
  - `FF FE 00 00` --> UTF-32LE
  - `00 00 FE FF` --> UTF-32BE
  - `FF FE` --> UTF-16LE
  - `FE FF` --> UTF-16BE
  - `EF BB BF` --> UTF-8 (handled by existing code)
- Null byte pattern inference (no BOM):
  - bytes[1]==0 && bytes[3]==0 --> UTF-16LE
  - bytes[0]==0 && bytes[2]==0 --> UTF-16BE
  - bytes[1,2,3]==0 --> UTF-32LE
  - bytes[0,1,2]==0 --> UTF-32BE

**Changes:**
- Added `detect_encoding()` function for BOM/pattern detection
- Modified `read_file_guts()` to auto-detect and strip BOM before conversion

Files less than 4 bytes are processed normally without inference.

---
Best regards,
Katayama Hirofumi MZ <[email protected]>

cpp-utf16.patch
Description: Binary data

cpp: Add UTF-16/UTF-32 encoding auto-detection in C preprocessor

Reply via email to