Re: [Gimp-user] approches used for language detection on images ...

Liam R E Quin Wed, 29 Jan 2020 12:07:24 -0800

On Wed, 2020-01-29 at 13:52 +0100, JWein wrote:
> > You need (1) feature extraction, finding the writing, (2) OCR of
> > some
> > sort, to turn pictures of letters into letters, and then (3) the
> > linguistic Analysis.
> 
>  Hey Liam:
> 
> Thank you, and yes, I could guess the way to go would be through the
> steps you
> outline, but I am pretty sure some other gimp developers have trodden
> those
> paths before and may have some tips to share.


I doubt it.

There _are_ somepeople who use GIMP to clean up images preparatory to
running OCR on them, or have been in the past, but there are much
better programs for that.

I asked you about text cleansing (cleaning) because it has different
meanings in different contexts; i'm *certainly* not interested in
losing the page apparatus or hyphenation information, although in my
own work i mark them so software can skip them whe wanted.

If you're doing an academic study of a book “manifestation” such things
are important, but i had rather use the Text Encoding Initiative as a
model than Michael Hart’s flailing Gutenberg project.

> I do the same kinds of things you do 

I doubt that, at least from your description, but some of it may be a
language issue in reading the tone of your message. If you are doing
natural language processing and semantic-Web-style text mining your
needs for texts overlap with my personal projects but not so much with
GIMP, which is a bitmap image editor. For example, detecting Greek
words and phrases included in a 30,000 page OCR's text by analyzing the
page images would interest me (and detecting italics for that matter);
if i ever have a spare few days i plan to try the (then) latest
Tesseract for that.

-- 
Liam Quin - web slave for https://www.fromoldbooks.org/

_______________________________________________
gimp-user-list mailing list
List address:    gimp-user-list@gnome.org
List membership: https://mail.gnome.org/mailman/listinfo/gimp-user-list
List archives:   https://mail.gnome.org/archives/gimp-user-list

Re: [Gimp-user] approches used for language detection on images ...

Reply via email to