[jira] [Comment Edited] (SOLR-7027) ExtractingRequestHandler indiscriminantly dumps all source HTML attributes into the catch-all field when captureAttr=false, but it should be more selective, something like only href, title, alt, etc. attributes

Uwe Schindler (JIRA) Sat, 24 Jan 2015 03:20:53 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-7027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14290547#comment-14290547
 ]


Uwe Schindler edited comment on SOLR-7027 at 1/24/15 11:20 AM:
---------------------------------------------------------------

Hi we should think about refactoring the SolrContentHandler class more. There 
are major problems. The absove URL shows how it should be done: 
[http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/TextExtractor.html#includeAttribute(net.htmlparser.jericho.StartTag,%20net.htmlparser.jericho.Attribute]

In fact if you have some HTML like {{<div><b>W</b>orld <b>D</b>ata 
<b>S</b>ystem (WDC)</div>}} the problem is that this is converted with too much 
whitespace: it will insert whitespace in the catch all field in a non-approp 
way: "W orld D ata S ystem (WDC)". If there are attributes involved it gests 
even worse: {{<div><span class="abbrev">W<span>orld <span 
class="abbrev">D<span>ata <span class="abbrev">S<span>ystem (WDC)</div>}}, 
causes "abbrev W orld abbrev D ata abbrev S ystem (WDC)". The Jerico parser has 
a better approach on that because it handles the HTML tags more careful and not 
just inserts Whitespace between them

The official HTML to Text converter provided by TIKA is also correct: It 
internally handles inline tags (like span, b) different than block tags (div, 
p,...). It only inserts a newline after block tags, but never any whitespace 
between inline tags. Of course it swallows attributes. But in my personal 
opinion, the "qualified" attributes like "abbrev", " href",... should be 
collected and only inserted at the end of the corresponding block tag. Solr 
should not insert whitespace between inline tags.


was (Author: thetaphi):
Hi we should think about refactoring the SolrContentHandler class more. There 
are major problems. The absove URL shows how it should be done: 
[http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/TextExtractor.html#includeAttribute(net.htmlparser.jericho.StartTag,%20net.htmlparser.jericho.Attribute]

In fact if you have some HTML like {{<div><b>W</b>orld <b>D</b>ata 
<b>S</b>ystem (WDC)</div>}} the problem is that this is converted with too much 
whitespace: it will insert whitespace in the catch all field in a non-approp 
way: "W orld D ata S ystem (WDC)". If there are attributes involved it gests 
even worse: {{<div><span class="abbrev">W<span>orld <span 
class="abbrev">D<span>ata <span class="abbrev">S<span>ystem (WDC)</div>}}, 
causes "abbrev W orld abbrev D ata abbrev S ystem (WDC)". The Jerico parser has 
a better approach on that because it handles the HTML tags more careful and not 
just inserts Whitespace between them

The official HTML to Text converter provided by TIKA is also correct: It 
internally handles inline tags (like span, b) different than block tags (div, 
p,...). It only inserts a newline after block tags, but never between inline 
tags. Of course it swallows attributes. But in my personal opinion, the 
"qualified" attributes like "abbrev", " href",... should be collected and only 
inserted at the end of the corresponding block tag.

> ExtractingRequestHandler indiscriminantly dumps all source HTML attributes 
> into the catch-all field when captureAttr=false, but it should be more 
> selective, something like only href, title, alt, etc. attributes
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-7027
>                 URL: https://issues.apache.org/jira/browse/SOLR-7027
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Solr Cell (Tika extraction)
>            Reporter: Steve Rowe
>            Priority: Minor
>
> On line 283 in {{SolrContentHandler}}, the catch-all field gets *all* source 
> HTML attribute values dumped into it:
> {code:java}
> 270:  @Override
> 271:  public void startElement(String uri, String localName, String qName, 
> Attributes attributes) throws SAXException {
> 272:    StringBuilder theBldr = fieldBuilders.get(localName);
> 273:    if (theBldr != null) {
> 274:      //we need to switch the currentBuilder
> 275:      bldrStack.add(theBldr);
> 276:    }
> 277:    if (captureAttribs == true) {
> 278:      for (int i = 0; i < attributes.getLength(); i++) {
> 279:        addField(localName, attributes.getValue(i), null);
> 280:      }
> 281:    } else {
> 282:      for (int i = 0; i < attributes.getLength(); i++) {
> 283:        bldrStack.getLast().append(' ').append(attributes.getValue(i));
> 284:      }
> 285:    }
> 286:    bldrStack.getLast().append(' ');
> 287:  }
> {code}
> But this will contains lots of unwanted cruft: {{class}} and {{style}} tags, 
> etc.
> It would be much better if only attribute values containing addresses or 
> tooltip text, etc. were dumped into the catch-all field.  Here are a couple 
> of places where this kind of attribute are described:
> http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/TextExtractor.html#includeAttribute(net.htmlparser.jericho.StartTag,%20net.htmlparser.jericho.Attribute)
> From Tika's {{HtmlHandler}} class:
> {code:java}
>     // List of attributes that need to be resolved.
>     private static final Set<String> URI_ATTRIBUTES =
>         new HashSet<String>(Arrays.asList("src", "href", "longdesc", "cite"));
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-7027) ExtractingRequestHandler indiscriminantly dumps all source HTML attributes into the catch-all field when captureAttr=false, but it should be more selective, something like only href, title, alt, etc. attributes

Reply via email to