[ https://issues.apache.org/jira/browse/CTAKES-189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Pei Chen updated CTAKES-189: ---------------------------- Fix Version/s: (was: 3.2.0) future enhancement > GSoC: Implement OCR/Tika to standardize text input for cTAKES > ------------------------------------------------------------- > > Key: CTAKES-189 > URL: https://issues.apache.org/jira/browse/CTAKES-189 > Project: cTAKES > Issue Type: New Feature > Affects Versions: 3.0-incubating > Reporter: Pei Chen > Labels: gsoc, gsoc2013 > Fix For: future enhancement > > Attachments: Gui.java > > > I am proposing to have a component in cTAKES that is capable of taking in > various types of content (PDF, Scanned JPG's, Word, XLS, TXT, etc.), > extracting the text content before passing it on to cTAKES for NLP processing. > There are currently open source libraries such as TIKA, JavaOCR as a starting > point but I have not found a centralized lib that also incorporates all of > the above including OCR into the flow easily. -- This message was sent by Atlassian JIRA (v6.2#6252)