[jira] [Comment Edited] (SOLR-7027) ExtractingRequestHandler indiscriminantly dumps all source HTML attributes into the catch-all field when captureAttr=false, but it should be more selective, something like only href, title, alt, etc. attributes

JIRA Tue, 20 Dec 2016 17:09:57 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-7027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15765748#comment-15765748
 ]


Ole Jørgen Brønner edited comment on SOLR-7027 at 12/21/16 1:08 AM:
--------------------------------------------------------------------

A semi-related issue that caught me off guard is that it doesn't seem to be 
possible to capture both attribute values ({{captureAttr}}) and content 
({{capture=h1}}) and be able to distinguish between the content and attributes?

Without {{captureAttr}} the content captured in the {{h1}} field will be very 
low quality since h1 tags commonly contain eg. {{class}} attributes, but with 
{{captureAttr}} the attribute values will be stored in the same field. (it 
doesn't seem possible to map the attributes and the content to different 
fields). They will be stored as different values in the multivalued field, but 
I don't think that helps much.

The documentation also says that when capturing elements ({{capture=h1}}) the 
content should also be present in the catch-all content field, but that doesn't 
align with my observations.


was (Author: olejorgenb):
A semi-related issue that caught me off guard is that it doesn't seem to be 
possible to capture both attribute values ({{captureAttr}}) and content 
({{capture=h1}}) and be able to distinguish between the content and attributes?

Without {{captureAttr}} the content captured in the {{h1}} field will be very 
low quality since h1 tags commonly contain eg. {{class}} attributes, but with 
{{captureAttr}} the attribute values will be stored in the same field. (it 
doesn't seem possible to map the attributes and the content to different 
fields). They will be stored as different values in the multivalued field, but 
I don't think that helps much.

The documentation says that when capturing elements ({{capture=h1}}) the 
content should also be present in the catch-all content field, but that doesn't 
align with my observations.

> ExtractingRequestHandler indiscriminantly dumps all source HTML attributes 
> into the catch-all field when captureAttr=false, but it should be more 
> selective, something like only href, title, alt, etc. attributes
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-7027
>                 URL: https://issues.apache.org/jira/browse/SOLR-7027
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Solr Cell (Tika extraction)
>    Affects Versions: 5.0
>            Reporter: Steve Rowe
>            Priority: Minor
>             Fix For: 5.2, 6.0
>
>
> On line 283 in {{SolrContentHandler}}, the catch-all field gets *all* source 
> HTML attribute values dumped into it:
> {code:java}
> 270:  @Override
> 271:  public void startElement(String uri, String localName, String qName, 
> Attributes attributes) throws SAXException {
> 272:    StringBuilder theBldr = fieldBuilders.get(localName);
> 273:    if (theBldr != null) {
> 274:      //we need to switch the currentBuilder
> 275:      bldrStack.add(theBldr);
> 276:    }
> 277:    if (captureAttribs == true) {
> 278:      for (int i = 0; i < attributes.getLength(); i++) {
> 279:        addField(localName, attributes.getValue(i), null);
> 280:      }
> 281:    } else {
> 282:      for (int i = 0; i < attributes.getLength(); i++) {
> 283:        bldrStack.getLast().append(' ').append(attributes.getValue(i));
> 284:      }
> 285:    }
> 286:    bldrStack.getLast().append(' ');
> 287:  }
> {code}
> But this will contains lots of unwanted cruft: {{class}} and {{style}} tags, 
> etc.
> It would be much better if only attribute values containing addresses or 
> tooltip text, etc. were dumped into the catch-all field.  Here are a couple 
> of places where this kind of attribute are described:
> http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/TextExtractor.html#includeAttribute(net.htmlparser.jericho.StartTag,%20net.htmlparser.jericho.Attribute)
> From Tika's {{HtmlHandler}} class:
> {code:java}
>     // List of attributes that need to be resolved.
>     private static final Set<String> URI_ATTRIBUTES =
>         new HashSet<String>(Arrays.asList("src", "href", "longdesc", "cite"));
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-7027) ExtractingRequestHandler indiscriminantly dumps all source HTML attributes into the catch-all field when captureAttr=false, but it should be more selective, something like only href, title, alt, etc. attributes

Reply via email to