[ 
https://issues.apache.org/jira/browse/CTAKES-189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pei Chen updated CTAKES-189:
----------------------------

    Fix Version/s:     (was: 3.2.0)
                   future enhancement

> GSoC: Implement OCR/Tika to standardize text input for cTAKES
> -------------------------------------------------------------
>
>                 Key: CTAKES-189
>                 URL: https://issues.apache.org/jira/browse/CTAKES-189
>             Project: cTAKES
>          Issue Type: New Feature
>    Affects Versions: 3.0-incubating
>            Reporter: Pei Chen
>              Labels: gsoc, gsoc2013
>             Fix For: future enhancement
>
>         Attachments: Gui.java
>
>
> I am proposing to have a component in cTAKES that is capable of taking in 
> various types of content (PDF, Scanned JPG's, Word, XLS, TXT, etc.), 
> extracting the text content before passing it on to cTAKES for NLP processing.
> There are currently open source libraries such as TIKA, JavaOCR as a starting 
> point but I have not found a centralized lib that also incorporates all of 
> the above including OCR into the flow easily.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to