[ 
http://issues.apache.org/jira/browse/JCR-281?page=comments#action_12358892 ] 

Michael Wechner commented on JCR-281:
-------------------------------------

you might want to reuse the HTMK Parser from Lenya:

http://svn.apache.org/repos/asf/lenya/branches/BRANCH_1_2_X/src/java/org/apache/lenya/lucene/html/

resp.

http://svn.apache.org/repos/asf/lenya/branches/BRANCH_1_2_X/src/java/org/apache/lenya/lucene/parser/

which is under Apache license, but might need some improvement.

> textfilters module patch: Support for text extraction for HTML,XML and RTF 
> files
> --------------------------------------------------------------------------------
>
>          Key: JCR-281
>          URL: http://issues.apache.org/jira/browse/JCR-281
>      Project: Jackrabbit
>         Type: Improvement
>   Components: query
>     Reporter: Martin Perez
>  Attachments: patch.diff
>
> This patch adds text extraction support form XML, RTF and HTML files.
> The unique dependency is htmlparser library for handling HTML text extraction.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to