tika-user  

How to customize parsing html, retrieve <div> content?

Anne Blankert
Thu, 03 Dec 2009 09:43:20 -0800

Hello list,

This question is about how to get the content of <div id="article">..interesting content...</div>

Is the <div> element skipped on purpose or is there a way to tell the parser what to pass through and what not?

I am using Tika to extract plain text from documents behind RssFeeds. Many of these documents are HTML. Most of these HTML pages are based on templates. The template content is repeated for every such HTML page and does not contain useful information. I am only interested in the added content, not the templates themselves. I found that almost all such HTML pages mark the start and end of the interesting part, something like this:

<div id="article">....</div> or <div class="news">....</div> etc.

I wrote an extended ContentHandler to filter these marked parts from the html. I figured if I override methods "DefaultHandler.StartElement()" and "DefaultHandler.StopElement()", I would be able to extract the contents of these <div> elements. But I was wrong: from my sample HTML files, the tika parser only seems to pass through elements: <html><head><title><body><p><a><ol><li><ul><table><tr><td><tbody> to the ContentHandler. ContentHandler.StartElement is not called for the <div> element.

I am using the tika parser like this:

<code>
URL itemURL = new URL(itemLink);
DataInputStream daHTMLfromDaItem = new DataInputStream(itemURL.openStream());

ContentHandler bodyContentHandler = new MyExtendedBodyContentHandler(htmlTag, htmlTagAttribute, htmlTagAttributeValue);
Metadata metadata = new Metadata();
AutoDetectParser p = new AutoDetectParser();
try {
 // get HTML and convert to text in bodyContentHandler
 p.parse(daHTMLfromDaItem, bodyContentHandler, metadata);
 ...
</code>

Is there a way to tell the parser to call the Handler.StartElement() and Handler.StopElement() methods for elements like <div> ? Or should I use another method to get the content of these <div> elements?

Thanks,

Anne Blankert