https://bugs.kde.org/show_bug.cgi?id=517639
Bug ID: 517639
Summary: OCR for Chinese text inserts unnecessary spaces
between characters
Classification: Applications
Product: Spectacle
Version First unspecified
Reported In:
Platform: Arch Linux
OS: Linux
Status: REPORTED
Severity: normal
Priority: NOR
Component: General
Assignee: [email protected]
Reporter: [email protected]
CC: [email protected]
Target Milestone: ---
When using Spectacle's OCR function to recognize Simplified Chinese text, it
automatically adds unnecessary spaces between each Chinese character.
Example:
Expected: 这是中文测试
Actual: 这 是 中 文 测 试
This is because Spectacle uses Tesseract but does not perform post-processing
to remove character spaces for CJK languages. The OCR result is unreadable for
Chinese users.
Steps to reproduce:
1. Take a screenshot containing Chinese text
2. Use Extract Text (OCR) in Spectacle
3. Paste the result, Chinese characters have spaces between them
OBSERVED RESULT
这 是 中 文 测 试
EXPECTED RESULT
这是中文测试
SOFTWARE/OS VERSIONS
Linux/KDE Plasma: 6.6.2
KDE Plasma Version: 6.6.2
KDE Frameworks Version: 6.23.0
Qt Version: 6.10.2
Spectacle: 6.6.2
Plasma: 6.6.2
Tesseract: 5.5.2
Distribution: Arch Linux
Suggestion:
Add post-processing to automatically remove spaces between CJK characters after
OCR.
--
You are receiving this mail because:
You are watching all bug changes.