[ https://issues.apache.org/jira/browse/SOLR-7027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14290547#comment-14290547 ]
Uwe Schindler edited comment on SOLR-7027 at 1/24/15 11:20 AM: --------------------------------------------------------------- Hi we should think about refactoring the SolrContentHandler class more. There are major problems. The absove URL shows how it should be done: [http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/TextExtractor.html#includeAttribute(net.htmlparser.jericho.StartTag,%20net.htmlparser.jericho.Attribute] In fact if you have some HTML like {{<div><b>W</b>orld <b>D</b>ata <b>S</b>ystem (WDC)</div>}} the problem is that this is converted with too much whitespace: it will insert whitespace in the catch all field in a non-approp way: "W orld D ata S ystem (WDC)". If there are attributes involved it gests even worse: {{<div><span class="abbrev">W<span>orld <span class="abbrev">D<span>ata <span class="abbrev">S<span>ystem (WDC)</div>}}, causes "abbrev W orld abbrev D ata abbrev S ystem (WDC)". The Jerico parser has a better approach on that because it handles the HTML tags more careful and not just inserts Whitespace between them The official HTML to Text converter provided by TIKA is also correct: It internally handles inline tags (like span, b) different than block tags (div, p,...). It only inserts a newline after block tags, but never any whitespace between inline tags. Of course it swallows attributes. But in my personal opinion, the "qualified" attributes like "abbrev", " href",... should be collected and only inserted at the end of the corresponding block tag. Solr should not insert whitespace between inline tags. was (Author: thetaphi): Hi we should think about refactoring the SolrContentHandler class more. There are major problems. The absove URL shows how it should be done: [http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/TextExtractor.html#includeAttribute(net.htmlparser.jericho.StartTag,%20net.htmlparser.jericho.Attribute] In fact if you have some HTML like {{<div><b>W</b>orld <b>D</b>ata <b>S</b>ystem (WDC)</div>}} the problem is that this is converted with too much whitespace: it will insert whitespace in the catch all field in a non-approp way: "W orld D ata S ystem (WDC)". If there are attributes involved it gests even worse: {{<div><span class="abbrev">W<span>orld <span class="abbrev">D<span>ata <span class="abbrev">S<span>ystem (WDC)</div>}}, causes "abbrev W orld abbrev D ata abbrev S ystem (WDC)". The Jerico parser has a better approach on that because it handles the HTML tags more careful and not just inserts Whitespace between them The official HTML to Text converter provided by TIKA is also correct: It internally handles inline tags (like span, b) different than block tags (div, p,...). It only inserts a newline after block tags, but never between inline tags. Of course it swallows attributes. But in my personal opinion, the "qualified" attributes like "abbrev", " href",... should be collected and only inserted at the end of the corresponding block tag. > ExtractingRequestHandler indiscriminantly dumps all source HTML attributes > into the catch-all field when captureAttr=false, but it should be more > selective, something like only href, title, alt, etc. attributes > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > > Key: SOLR-7027 > URL: https://issues.apache.org/jira/browse/SOLR-7027 > Project: Solr > Issue Type: Improvement > Components: contrib - Solr Cell (Tika extraction) > Reporter: Steve Rowe > Priority: Minor > > On line 283 in {{SolrContentHandler}}, the catch-all field gets *all* source > HTML attribute values dumped into it: > {code:java} > 270: @Override > 271: public void startElement(String uri, String localName, String qName, > Attributes attributes) throws SAXException { > 272: StringBuilder theBldr = fieldBuilders.get(localName); > 273: if (theBldr != null) { > 274: //we need to switch the currentBuilder > 275: bldrStack.add(theBldr); > 276: } > 277: if (captureAttribs == true) { > 278: for (int i = 0; i < attributes.getLength(); i++) { > 279: addField(localName, attributes.getValue(i), null); > 280: } > 281: } else { > 282: for (int i = 0; i < attributes.getLength(); i++) { > 283: bldrStack.getLast().append(' ').append(attributes.getValue(i)); > 284: } > 285: } > 286: bldrStack.getLast().append(' '); > 287: } > {code} > But this will contains lots of unwanted cruft: {{class}} and {{style}} tags, > etc. > It would be much better if only attribute values containing addresses or > tooltip text, etc. were dumped into the catch-all field. Here are a couple > of places where this kind of attribute are described: > http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/TextExtractor.html#includeAttribute(net.htmlparser.jericho.StartTag,%20net.htmlparser.jericho.Attribute) > From Tika's {{HtmlHandler}} class: > {code:java} > // List of attributes that need to be resolved. > private static final Set<String> URI_ATTRIBUTES = > new HashSet<String>(Arrays.asList("src", "href", "longdesc", "cite")); > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org