Anne Blankert
Mon, 01 Feb 2010 01:05:41 -0800
Hello,Changing the HTML-handler in the configuration is not so easy. I think I had about the same question (see list-thread "How to customize parsing html, retrieve <div> content"). The list came up with the following solution (setting MyHtmlMapper in Context should be available as of tika 0.6):
class MyHtmlMapper extends DefaultHtmlMapper {
public String mapSafeElement(String name) {
if ("DIV".equals(name)) return "div";
return super.mapSafeElement(name);
}
}
Parser parser = ...;
ParseContext context = new ParseContext();
context.set(HtmlMapper.class, new MyHtmlMapper());
parser.parse(..., context);
Anne
On 2010-01-30 1:34, florent andré wrote:
So ok, I found a solution - surely not the optimal one - but I will share my experience with you.HtmlParser is not "extends enabled" because: 1 - all attributes are private and have to be protected 2 - resolve() is in the same case3 - call to super.startElement() is not so easy because of body/title/discard level counting.HtmlParser is more extendEnabled, but the only reason why I extend this class is to modify the "hardcoded" new HtmlHandler in expression parser.setContentHandler(new XHTMLDowngradeHandler( new HtmlHandler(this, handler, metadata)));to MyHtmlHandler(...).Maybe a configuration solution for this class instanciation will be profitable.Can you tell me if I don't take the right way, and if a possibility to "overwrite/extend" the features of parser is in your roadmap ?My two pences... have a good day ++ Florent André wrote:Hi all, I work on html parsing via generic AutoDetectParser() class. I have to keep some "specific" attributes (id and class) in <table> attribute in order to detect witch table have "meaning" for my app. So, as far as I understand for now, I have to : - extend HtmlHandler with MyHtmlHandler - in MyHtmlHandler override public void startElement(...) with something like this : if (bodyLevel == 0 && discardLevel == 0) { if ("TABLE".equals(name)){ AttributesImpl attributes = new AttributesImpl(); String id = atts.getValue("id"); String class = atts.getValue("class"); if (id != null){ attributes.addAttribute("", "id", "id", "CDATA", id); } if (class != null){attributes.addAttribute("", "class", "class", "CDATA", class); } xhtml.startElement("http://www.w3.org/1999/xhtml", "table", "table",attributes); } else{ //if other that table super.startElement(...) } else{ //if other bodyLevel and discardLevel super.startElement(...) } - And finally pass MyHtmlHandler to parse() method via parseContext. ****** This is the right way to do such a thing ? * How I can use the parseContext to pass MyHtmlHandler ? I don't find anyexample on it... Any comment will be much appreciated, Have a good day
-- Drs. Anne Blankert Geodan Systems & Research President Kennedylaan 1 1079 MB Amsterdam (NL) ------------------------------------- Tel: +31 (0)20 - 5711 311 Fax: +31 (0)20 - 5711 333 ------------------------------------- E-mail: anne.blank...@geodan.nl Website: www.geodan.nl Disclaimer: www.geodan.nl/disclaimer-------------------------------------