>You need (1) feature extraction, finding the writing, (2) OCR of some >sort, to turn pictures of letters into letters, and then (3) the >linguistic Analysis.
Hey Liam: Thank you, and yes, I could guess the way to go would be through the steps you outline, but I am pretty sure some other gimp developers have trodden those paths before and may have some tips to share. >However, many images contain metadata in plain text (OK, XML or >whatever) that may include language and location information. Most of the kinds of texts I work on are image based pdf files which were scanned as images >I'm interested in the text cleansing, can you tell me more (off list >maybe?) "text cleansing" or "text normalization" (as they also call it, but which to most people is another phase of "cleansing", for example, making sure that the text is "normalized", e.g., in a java.text.Normalizer.Form way) means removing all the bsing visual distraction and the ephemeral comercial nonsense from pages. https://www.google.com/search?q="text+cleansing" For example, gutenberg.org, has taken the effort to textualize lots of books, but they include some nonsensical header and footer, use breaklines (something necessary in those times people used main frames which displays were 80 character wide, ...) This kind of nonsense has become the new normal. I work as a teacher and I see it as abusive specially when done to students and people who are just trying to get something done. Companies internally block certain sites, types of content, pages and sections of pages, it is about time that people start doing it more aggressively on their own. Some other people tell you about "user agreements", "morallity" and about "capitalism going down if people start doing that more aggressively" ;-) I do the same kinds of things you do but these times I am more interested in texts especially if they relate to education. Mine of my research efforts relates to a corpus of the Regents exams (going back to the 1860's). They contain plenty of intertextual pictures and zero comma nada annotations, frequent language switch in the texts . . . -- JWein (via www.gimpusers.com/forums) _______________________________________________ gimp-user-list mailing list List address: gimp-user-list@gnome.org List membership: https://mail.gnome.org/mailman/listinfo/gimp-user-list List archives: https://mail.gnome.org/archives/gimp-user-list