On Wed, 2020-01-29 at 13:52 +0100, JWein wrote: > > You need (1) feature extraction, finding the writing, (2) OCR of > > some > > sort, to turn pictures of letters into letters, and then (3) the > > linguistic Analysis. > > Hey Liam: > > Thank you, and yes, I could guess the way to go would be through the > steps you > outline, but I am pretty sure some other gimp developers have trodden > those > paths before and may have some tips to share.
I doubt it. There _are_ somepeople who use GIMP to clean up images preparatory to running OCR on them, or have been in the past, but there are much better programs for that. I asked you about text cleansing (cleaning) because it has different meanings in different contexts; i'm *certainly* not interested in losing the page apparatus or hyphenation information, although in my own work i mark them so software can skip them whe wanted. If you're doing an academic study of a book “manifestation” such things are important, but i had rather use the Text Encoding Initiative as a model than Michael Hart’s flailing Gutenberg project. > I do the same kinds of things you do I doubt that, at least from your description, but some of it may be a language issue in reading the tone of your message. If you are doing natural language processing and semantic-Web-style text mining your needs for texts overlap with my personal projects but not so much with GIMP, which is a bitmap image editor. For example, detecting Greek words and phrases included in a 30,000 page OCR's text by analyzing the page images would interest me (and detecting italics for that matter); if i ever have a spare few days i plan to try the (then) latest Tesseract for that. -- Liam Quin - web slave for https://www.fromoldbooks.org/ _______________________________________________ gimp-user-list mailing list List address: gimp-user-list@gnome.org List membership: https://mail.gnome.org/mailman/listinfo/gimp-user-list List archives: https://mail.gnome.org/archives/gimp-user-list