[ 
https://issues.apache.org/jira/browse/CLEREZZA-182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915021#action_12915021
 ] 

Davide Palmisano commented on CLEREZZA-182:
-------------------------------------------

Dear Tommaso,

In the attached patch[1] (taken from 
/trunk/org.apache.clerezza.parent/org.apache.clerezza.uima/org.apache.clerezza.uima.metadata-generator)
 you can find an attempt to integrate Apache Tika 0.7 implementing the 
MediaTypeTextExtractor interface. My modifies foresee:

1) tika dependency added to the pom.xml
2) two tests (one for my implementation, TikaTextExtractor, and one for your 
PlainTextExtractor class)
3) some added javadocs on the MediaTypeTextExtractor interface.
4) a couple of new constructors for the UnsupportedMediaTypeException exception.

let me know if it fits your needs.

Davide

[1] CLEREZZA-182.patch

> Integrate Apache Tika inside Apache Clerezza
> --------------------------------------------
>
>                 Key: CLEREZZA-182
>                 URL: https://issues.apache.org/jira/browse/CLEREZZA-182
>             Project: Clerezza
>          Issue Type: New Feature
>            Reporter: Tommaso Teofili
>         Attachments: CLEREZZA-182.patch
>
>
> Apache Tika is a toolkit for detecting and extracting metadata and structured 
> text content from various documents using existing parser libraries and it 
> would be nice to have it integrated inside Apache Clerezza so that Resources 
> could be easily enriched and auto-tagged with Metadata once inside Clerezza

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to