tika-user  

Re: How to customize parsing html, retrieve <div> content?

Jukka Zitting
Tue, 15 Dec 2009 17:54:54 -0800

Hi,

On Tue, Dec 15, 2009 at 10:31 PM, Anne Blankert <anne.blank...@geodan.nl> wrote:
> The following solves the quiet omission of <div> elements by the tika html
> parser:
>
> changed file apache/tika/parser/html/HtmlParser.java
> method
>   protected String mapSafeElement(String name)
> added line
>   if ("DIV".equals(name)) return "div";
>
> Could this change be applied to the tika source?

I'm not too excited about this change as it would be good to keep the
Tika output as simple as possible by default. The <div> elements
contain no inherent semantic meaning, so for a generic client (i.e.
one without domain-specific knowledge) they'd just be an unnecessary
distraction.

However, I can see how a client that does have better knowledge about
the expected document structure might want to have such information
passed through by Tika. See TIKA-347 for the very latest recommended
solution to this.

> Subclassing HtmlParser does not seem to be an easy alternative solution,
> because it requires changing the default TikaConfig.

See the TIKA-347 changes that I've just committed to the Tika trunk
and that will be included in the upcoming Tika 0.6 release. With these
changes it's possible to pass customized HTML mapping rules through
the parse context mechanism that was introduced in Tika 0.5. For
example, you could do this:

    class MyHtmlMapper extends DefaultHtmlMapper {
        public String mapSafeElement(String name) {
            if ("DIV".equals(name)) return "div";
            return super.mapSafeElement(name);
        }
    }

    Parser parser = ...;
    ParseContext context = new ParseContext();
    context.set(HtmlMapper.class, new MyHtmlMapper());
    parser.parse(..., context);

BR,

Jukka Zitting