[ http://issues.apache.org/jira/browse/JCR-281?page=comments#action_12358892 ]
Michael Wechner commented on JCR-281: ------------------------------------- you might want to reuse the HTMK Parser from Lenya: http://svn.apache.org/repos/asf/lenya/branches/BRANCH_1_2_X/src/java/org/apache/lenya/lucene/html/ resp. http://svn.apache.org/repos/asf/lenya/branches/BRANCH_1_2_X/src/java/org/apache/lenya/lucene/parser/ which is under Apache license, but might need some improvement. > textfilters module patch: Support for text extraction for HTML,XML and RTF > files > -------------------------------------------------------------------------------- > > Key: JCR-281 > URL: http://issues.apache.org/jira/browse/JCR-281 > Project: Jackrabbit > Type: Improvement > Components: query > Reporter: Martin Perez > Attachments: patch.diff > > This patch adds text extraction support form XML, RTF and HTML files. > The unique dependency is htmlparser library for handling HTML text extraction. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
