[jira] [Commented] (SOLR-9178) ExtractingRequestHandler doesn't strip HTML and adds metadata to content body

JIRA Tue, 20 Dec 2016 16:42:12 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15765691#comment-15765691
 ]


Ole Jørgen Brønner commented on SOLR-9178:
------------------------------------------

I think this (unfortunately) is expected behavior. If you look closer, the 
actually html tags have been removed, but the attribute values have been 
retained. 

You can get rid of the attribute values by adding {{captureAttr=true}}, but if 
you also want to capture some elements in separate fields ({{capture=p}}) 
you're out of luck (ie. you can't separate the captured attributes from the 
captured tag content)

Disclaimer: I'm no solr expert, but have recently spent a decent amount of time 
trying to bend cell/tika to my liking (unsuccessfully)

Related issue: SOLR-7027

> ExtractingRequestHandler doesn't strip HTML and adds metadata to content body
> -----------------------------------------------------------------------------
>
>                 Key: SOLR-9178
>                 URL: https://issues.apache.org/jira/browse/SOLR-9178
>             Project: Solr
>          Issue Type: Bug
>          Components: update
>    Affects Versions: 5.0, 6.0.1
>         Environment: java version "1.8.0_91" 64 bit
> Linux Mint 17, 64 bit
>            Reporter: Simon Blandford
>
> Starting environment:
> solr-6.0.1.tgz is downloaded and extracted. We are in the solr-6.0.1 
> directory.
> The file, test.html, is downloaded from 
> https://wiki.apache.org/solr/UsingMailingLists.
> Affected versions: 4.10.3 is the last working version. 4.10.4 has some HTML 
> comments and Javascript breaking through. Versions >5.0 have full symptoms 
> described.
> Steps to reproduce:
> 1) bin/solr start
> 2) bin/solr create -c mycore
> 3) curl 
> "http://localhost:8983/solr/mycore/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true";
>  -F "content/[email protected]"
> 4) curl http://localhost:8983/solr/mycore/select?q=information
> Expected result: HTML->Text version of document indexed in <response> content 
> body.
> Actual result: Full HTML, but with anglebrackets removed, being indexed along 
> with other unwanted metadata in the content body including fragments of CSS 
> and Javascript that were in the source document. 
> Head of response body below...
> <?xml version="1.0" encoding="UTF-8"?>
> <response>
> <lst name="responseHeader"><int name="status">0</int><int 
> name="QTime">0</int><lst name="params"><str 
> name="q">information</str></lst></lst><result name="response" numFound="1" 
> start="0"><doc><str name="id">doc1</str><arr 
> name="attr_stream_size"><str>20440</str></arr><arr 
> name="attr_x_parsed_by"><str>org.apache.tika.parser.DefaultParser</str><str>org.apache.tika.parser.html.HtmlParser</str></arr><arr
>  name="attr_stream_content_type"><str>text/html</str></arr><arr 
> name="attr_stream_name"><str>test.html</str></arr><arr 
> name="attr_stream_source_info"><str>content/tutorial</str></arr><arr 
> name="attr_dc_title"><str>UsingMailingLists - Solr Wiki</str></arr><arr 
> name="attr_content_encoding"><str>UTF-8</str></arr><arr 
> name="attr_robots"><str>index,nofollow</str></arr><arr 
> name="attr_title"><str>UsingMailingLists - Solr Wiki</str></arr><arr 
> name="attr_content_type"><str>text/html; charset=utf-8</str></arr><arr 
> name="attr_content"><str> 
>  
>  stylesheet text/css utf-8 all /wiki/modernized/css/common.css   stylesheet 
> text/css utf-8 screen /wiki/modernized/css/screen.css   stylesheet text/css 
> utf-8 print /wiki/modernized/css/print.css   stylesheet text/css utf-8 
> projection /wiki/modernized/css/projection.css   alternate Solr Wiki: 
> UsingMailingLists 
> /solr/UsingMailingLists?diffs=1&amp;show_att=1&amp;action=rss_rc&amp;unique=0&amp;page=UsingMailingLists&amp;ddiffs=1
>  application/rss+xml   Start /solr/FrontPage   Alternate Wiki Markup 
> /solr/UsingMailingLists?action=raw   Alternate print Print View 
> /solr/UsingMailingLists?action=print   Search /solr/FindPage   Index 
> /solr/TitleIndex   Glossary /solr/WordIndex   Help /solr/HelpOnFormatting   
> stream_size 20440  
>  X-Parsed-By org.apache.tika.parser.DefaultParser  
>  X-Parsed-By org.apache.tika.parser.html.HtmlParser  
>  stream_content_type text/html  
>  stream_name test.html  
>  stream_source_info content/tutorial  
>  dc:title UsingMailingLists - Solr Wiki  
>  Content-Encoding UTF-8  
>  robots index,nofollow  
>  Content-Type text/html; charset=utf-8  
>  UsingMailingLists - Solr Wiki 
>  
>  
>  header 
>  application/x-www-form-urlencoded get searchform /solr/UsingMailingLists 
>  
>  hidden action fullsearch  
>  hidden context 180  
>  searchinput Search: 
>  text searchinput value  20 searchFocus(this) searchBlur(this) 
> searchChange(this) searchChange(this) Search  
>  submit titlesearch titlesearch Titles Search Titles  
>  submit fullsearch fullsearch Text Search Full Text  
>  
>  
>  text/javascript 
> &lt;!--// Initialize search form
> var f = document.getElementById('searchform');
> f.getElementsByTagName('label')[0].style.display = 'none';
> var e = document.getElementById('searchinput');
> searchChange(e);
> searchBlur(e);
> //--&gt;
>  
>  logo  rect /solr/FrontPage Solr Wiki  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-9178) ExtractingRequestHandler doesn't strip HTML and adds metadata to content body

Reply via email to