[ http://issues.apache.org/jira/browse/JCR-281?page=comments#action_12358893 ]
Marcel Reutegger commented on JCR-281: -------------------------------------- Martin, I quickly checked the web and there are some alternatives that you might want to consider for parsing html: - javax.swing.text.html.parser.Parser (part of the 1.4 JDK) - http://www.apache.org/~andyc/neko/doc/html/ (apache license) > textfilters module patch: Support for text extraction for HTML,XML and RTF > files > -------------------------------------------------------------------------------- > > Key: JCR-281 > URL: http://issues.apache.org/jira/browse/JCR-281 > Project: Jackrabbit > Type: Improvement > Components: query > Reporter: Martin Perez > Attachments: patch.diff > > This patch adds text extraction support form XML, RTF and HTML files. > The unique dependency is htmlparser library for handling HTML text extraction. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira