[Spectacle] [Bug 517639] OCR for Chinese text inserts unnecessary spaces between characters

Noah Davis Mon, 16 Mar 2026 08:40:07 -0700

https://bugs.kde.org/show_bug.cgi?id=517639


Noah Davis <[email protected]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|REPORTED                    |NEEDSINFO
         Resolution|---                         |WAITINGFORINFO

--- Comment #1 from Noah Davis <[email protected]> ---
> This is because Spectacle uses Tesseract but does not perform post-processing 
> to remove character spaces for CJK languages. The OCR result is unreadable 
> for Chinese users.

Is it common practice to process the CJK output of tesseract? I'm a bit wary of
doing our own processing of tesseract output separately from tesseract's own
options. I am not an expert on the various forms of Chinese, Japanese and
Korean scripts. While I'm sure you have far more experience with Chinese
scripts than I do, it would be nice to follow some kind of standard instead of
just doing our own thing. One could also make the argument on a technical level
that improving tesseract is the correct solution, but I don't know how
difficult that would be.

-- 
You are receiving this mail because:
You are watching all bug changes.

[Spectacle] [Bug 517639] OCR for Chinese text inserts unnecessary spaces between characters

Reply via email to