Anne Blankert
Tue, 15 Dec 2009 13:31:54 -0800
changed file apache/tika/parser/html/HtmlParser.java
method
protected String mapSafeElement(String name)
added line
if ("DIV".equals(name)) return "div";
Could this change be applied to the tika source?
Subclassing HtmlParser does not seem to be an easy alternative solution,
because it requires changing the default TikaConfig.
On 2009-12-03 18:42, Anne Blankert wrote:
Hello list,This question is about how to get the content of <div id="article">..interesting content...</div>Is the <div> element skipped on purpose or is there a way to tell the parser what to pass through and what not?I am using Tika to extract plain text from documents behind RssFeeds. Many of these documents are HTML. Most of these HTML pages are based on templates. The template content is repeated for every such HTML page and does not contain useful information. I am only interested in the added content, not the templates themselves. I found that almost all such HTML pages mark the start and end of the interesting part, something like this:<div id="article">....</div> or <div class="news">....</div> etc.I wrote an extended ContentHandler to filter these marked parts from the html. I figured if I override methods "DefaultHandler.StartElement()" and "DefaultHandler.StopElement()", I would be able to extract the contents of these <div> elements. But I was wrong: from my sample HTML files, the tika parser only seems to pass through elements: <html><head><title><body><p><a><ol><li><ul><table><tr><td><tbody> to the ContentHandler. ContentHandler.StartElement is not called for the <div> element.I am using the tika parser like this: <code> URL itemURL = new URL(itemLink);DataInputStream daHTMLfromDaItem = new DataInputStream(itemURL.openStream());ContentHandler bodyContentHandler = new MyExtendedBodyContentHandler(htmlTag, htmlTagAttribute, htmlTagAttributeValue);Metadata metadata = new Metadata(); AutoDetectParser p = new AutoDetectParser(); try { // get HTML and convert to text in bodyContentHandler p.parse(daHTMLfromDaItem, bodyContentHandler, metadata); ... </code>Is there a way to tell the parser to call the Handler.StartElement() and Handler.StopElement() methods for elements like <div> ? Or should I use another method to get the content of these <div> elements?Thanks, Anne Blankert
-- Drs. Anne Blankert Geodan Systems & Research President Kennedylaan 1 1079 MB Amsterdam (NL) ------------------------------------- Tel: +31 (0)20 - 5711 311 Fax: +31 (0)20 - 5711 333 ------------------------------------- E-mail: anne.blank...@geodan.nl Website: www.geodan.nl Disclaimer: www.geodan.nl/disclaimer-------------------------------------