[jira] [Updated] (SOLR-9178) ExtractingRequestHandler doesn't strip HTML and adds metadata tags to content body

Simon Blandford (JIRA) Wed, 01 Jun 2016 07:21:14 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Simon Blandford updated SOLR-9178:
----------------------------------
    Description: 
Starting environment:
solr-6.0.1.tgz is downloaded and extracted. We are in the solr-6.0.1 directory.
The file, test.html, is downloaded from 
https://wiki.apache.org/solr/UsingMailingLists.

Steps to reproduce:
1) bin/solr start
2) bin/solr create -c mycore

3) curl 
"http://localhost:8983/solr/mycore/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true";
 -F "content/[email protected]"

4) curl http://localhost:8983/solr/mycore/select?q=information

Expected result: HTML->Text version of document indexed in <response> content 
body.

Actual result: Full HTML, but with anglebrackets removed, being indexed along 
with other unwanted metadata in the <response> body including fragments of CSS 
and Javascript that were in the source document. 

Head of response body below...

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int 
name="QTime">0</int><lst name="params"><str 
name="q">information</str></lst></lst><result name="response" numFound="1" 
start="0"><doc><str name="id">doc1</str><arr 
name="attr_stream_size"><str>20440</str></arr><arr 
name="attr_x_parsed_by"><str>org.apache.tika.parser.DefaultParser</str><str>org.apache.tika.parser.html.HtmlParser</str></arr><arr
 name="attr_stream_content_type"><str>text/html</str></arr><arr 
name="attr_stream_name"><str>test.html</str></arr><arr 
name="attr_stream_source_info"><str>content/tutorial</str></arr><arr 
name="attr_dc_title"><str>UsingMailingLists - Solr Wiki</str></arr><arr 
name="attr_content_encoding"><str>UTF-8</str></arr><arr 
name="attr_robots"><str>index,nofollow</str></arr><arr 
name="attr_title"><str>UsingMailingLists - Solr Wiki</str></arr><arr 
name="attr_content_type"><str>text/html; charset=utf-8</str></arr><arr 
name="attr_content"><str> 
 
 stylesheet text/css utf-8 all /wiki/modernized/css/common.css   stylesheet 
text/css utf-8 screen /wiki/modernized/css/screen.css   stylesheet text/css 
utf-8 print /wiki/modernized/css/print.css   stylesheet text/css utf-8 
projection /wiki/modernized/css/projection.css   alternate Solr Wiki: 
UsingMailingLists 
/solr/UsingMailingLists?diffs=1&amp;show_att=1&amp;action=rss_rc&amp;unique=0&amp;page=UsingMailingLists&amp;ddiffs=1
 application/rss+xml   Start /solr/FrontPage   Alternate Wiki Markup 
/solr/UsingMailingLists?action=raw   Alternate print Print View 
/solr/UsingMailingLists?action=print   Search /solr/FindPage   Index 
/solr/TitleIndex   Glossary /solr/WordIndex   Help /solr/HelpOnFormatting   
stream_size 20440  
 X-Parsed-By org.apache.tika.parser.DefaultParser  
 X-Parsed-By org.apache.tika.parser.html.HtmlParser  
 stream_content_type text/html  
 stream_name test.html  
 stream_source_info content/tutorial  
 dc:title UsingMailingLists - Solr Wiki  
 Content-Encoding UTF-8  
 robots index,nofollow  
 Content-Type text/html; charset=utf-8  
 UsingMailingLists - Solr Wiki 
 
 

 header 

 application/x-www-form-urlencoded get searchform /solr/UsingMailingLists 
 
 hidden action fullsearch  
 hidden context 180  
 searchinput Search: 
 text searchinput value  20 searchFocus(this) searchBlur(this) 
searchChange(this) searchChange(this) Search  
 submit titlesearch titlesearch Titles Search Titles  
 submit fullsearch fullsearch Text Search Full Text  
 

 

 text/javascript 
&lt;!--// Initialize search form
var f = document.getElementById('searchform');
f.getElementsByTagName('label')[0].style.display = 'none';
var e = document.getElementById('searchinput');
searchChange(e);
searchBlur(e);
//--&gt;
 

 logo  rect /solr/FrontPage Solr Wiki  


  was:
Starting environment:
solr-6.0.1.tgz is downloaded and extracted. We are in the solr-6.0.1 directory.
The file, test.html, is downloaded from 
https://wiki.apache.org/solr/UsingMailingLists.

Steps to reproduce:
1) bin/solr start
2) bin/solr create -c mycore

3) curl 
"http://localhost:8983/solr/mycore/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true";
 -F "content/[email protected]"

4) curl http://localhost:8983/solr/mycore/select?q=information

Expected result: HTML->Text version of document indexed in <response> body.

Actual result: Full HTML, but with anglebrackets removed, being indexed along 
with other unwanted metadata in the <response> body including fragments of CSS 
and Javascript that were in the source document. 

Head of response body below...

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int 
name="QTime">0</int><lst name="params"><str 
name="q">information</str></lst></lst><result name="response" numFound="1" 
start="0"><doc><str name="id">doc1</str><arr 
name="attr_stream_size"><str>20440</str></arr><arr 
name="attr_x_parsed_by"><str>org.apache.tika.parser.DefaultParser</str><str>org.apache.tika.parser.html.HtmlParser</str></arr><arr
 name="attr_stream_content_type"><str>text/html</str></arr><arr 
name="attr_stream_name"><str>test.html</str></arr><arr 
name="attr_stream_source_info"><str>content/tutorial</str></arr><arr 
name="attr_dc_title"><str>UsingMailingLists - Solr Wiki</str></arr><arr 
name="attr_content_encoding"><str>UTF-8</str></arr><arr 
name="attr_robots"><str>index,nofollow</str></arr><arr 
name="attr_title"><str>UsingMailingLists - Solr Wiki</str></arr><arr 
name="attr_content_type"><str>text/html; charset=utf-8</str></arr><arr 
name="attr_content"><str> 
 
 stylesheet text/css utf-8 all /wiki/modernized/css/common.css   stylesheet 
text/css utf-8 screen /wiki/modernized/css/screen.css   stylesheet text/css 
utf-8 print /wiki/modernized/css/print.css   stylesheet text/css utf-8 
projection /wiki/modernized/css/projection.css   alternate Solr Wiki: 
UsingMailingLists 
/solr/UsingMailingLists?diffs=1&amp;show_att=1&amp;action=rss_rc&amp;unique=0&amp;page=UsingMailingLists&amp;ddiffs=1
 application/rss+xml   Start /solr/FrontPage   Alternate Wiki Markup 
/solr/UsingMailingLists?action=raw   Alternate print Print View 
/solr/UsingMailingLists?action=print   Search /solr/FindPage   Index 
/solr/TitleIndex   Glossary /solr/WordIndex   Help /solr/HelpOnFormatting   
stream_size 20440  
 X-Parsed-By org.apache.tika.parser.DefaultParser  
 X-Parsed-By org.apache.tika.parser.html.HtmlParser  
 stream_content_type text/html  
 stream_name test.html  
 stream_source_info content/tutorial  
 dc:title UsingMailingLists - Solr Wiki  
 Content-Encoding UTF-8  
 robots index,nofollow  
 Content-Type text/html; charset=utf-8  
 UsingMailingLists - Solr Wiki 
 
 

 header 

 application/x-www-form-urlencoded get searchform /solr/UsingMailingLists 
 
 hidden action fullsearch  
 hidden context 180  
 searchinput Search: 
 text searchinput value  20 searchFocus(this) searchBlur(this) 
searchChange(this) searchChange(this) Search  
 submit titlesearch titlesearch Titles Search Titles  
 submit fullsearch fullsearch Text Search Full Text  
 

 

 text/javascript 
&lt;!--// Initialize search form
var f = document.getElementById('searchform');
f.getElementsByTagName('label')[0].style.display = 'none';
var e = document.getElementById('searchinput');
searchChange(e);
searchBlur(e);
//--&gt;
 

 logo  rect /solr/FrontPage Solr Wiki  


        Summary: ExtractingRequestHandler doesn't strip HTML and adds metadata 
tags to content body  (was: ExtractingRequestHandler doesn't strip HTML and 
adds metadata tags to indexed body)

> ExtractingRequestHandler doesn't strip HTML and adds metadata tags to content 
> body
> ----------------------------------------------------------------------------------
>
>                 Key: SOLR-9178
>                 URL: https://issues.apache.org/jira/browse/SOLR-9178
>             Project: Solr
>          Issue Type: Bug
>          Components: update
>    Affects Versions: 6.0.1
>         Environment: java version "1.8.0_91" 64 bit
> Linux Mint 17, 64 bit
>            Reporter: Simon Blandford
>
> Starting environment:
> solr-6.0.1.tgz is downloaded and extracted. We are in the solr-6.0.1 
> directory.
> The file, test.html, is downloaded from 
> https://wiki.apache.org/solr/UsingMailingLists.
> Steps to reproduce:
> 1) bin/solr start
> 2) bin/solr create -c mycore
> 3) curl 
> "http://localhost:8983/solr/mycore/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true";
>  -F "content/[email protected]"
> 4) curl http://localhost:8983/solr/mycore/select?q=information
> Expected result: HTML->Text version of document indexed in <response> content 
> body.
> Actual result: Full HTML, but with anglebrackets removed, being indexed along 
> with other unwanted metadata in the <response> body including fragments of 
> CSS and Javascript that were in the source document. 
> Head of response body below...
> <?xml version="1.0" encoding="UTF-8"?>
> <response>
> <lst name="responseHeader"><int name="status">0</int><int 
> name="QTime">0</int><lst name="params"><str 
> name="q">information</str></lst></lst><result name="response" numFound="1" 
> start="0"><doc><str name="id">doc1</str><arr 
> name="attr_stream_size"><str>20440</str></arr><arr 
> name="attr_x_parsed_by"><str>org.apache.tika.parser.DefaultParser</str><str>org.apache.tika.parser.html.HtmlParser</str></arr><arr
>  name="attr_stream_content_type"><str>text/html</str></arr><arr 
> name="attr_stream_name"><str>test.html</str></arr><arr 
> name="attr_stream_source_info"><str>content/tutorial</str></arr><arr 
> name="attr_dc_title"><str>UsingMailingLists - Solr Wiki</str></arr><arr 
> name="attr_content_encoding"><str>UTF-8</str></arr><arr 
> name="attr_robots"><str>index,nofollow</str></arr><arr 
> name="attr_title"><str>UsingMailingLists - Solr Wiki</str></arr><arr 
> name="attr_content_type"><str>text/html; charset=utf-8</str></arr><arr 
> name="attr_content"><str> 
>  
>  stylesheet text/css utf-8 all /wiki/modernized/css/common.css   stylesheet 
> text/css utf-8 screen /wiki/modernized/css/screen.css   stylesheet text/css 
> utf-8 print /wiki/modernized/css/print.css   stylesheet text/css utf-8 
> projection /wiki/modernized/css/projection.css   alternate Solr Wiki: 
> UsingMailingLists 
> /solr/UsingMailingLists?diffs=1&amp;show_att=1&amp;action=rss_rc&amp;unique=0&amp;page=UsingMailingLists&amp;ddiffs=1
>  application/rss+xml   Start /solr/FrontPage   Alternate Wiki Markup 
> /solr/UsingMailingLists?action=raw   Alternate print Print View 
> /solr/UsingMailingLists?action=print   Search /solr/FindPage   Index 
> /solr/TitleIndex   Glossary /solr/WordIndex   Help /solr/HelpOnFormatting   
> stream_size 20440  
>  X-Parsed-By org.apache.tika.parser.DefaultParser  
>  X-Parsed-By org.apache.tika.parser.html.HtmlParser  
>  stream_content_type text/html  
>  stream_name test.html  
>  stream_source_info content/tutorial  
>  dc:title UsingMailingLists - Solr Wiki  
>  Content-Encoding UTF-8  
>  robots index,nofollow  
>  Content-Type text/html; charset=utf-8  
>  UsingMailingLists - Solr Wiki 
>  
>  
>  header 
>  application/x-www-form-urlencoded get searchform /solr/UsingMailingLists 
>  
>  hidden action fullsearch  
>  hidden context 180  
>  searchinput Search: 
>  text searchinput value  20 searchFocus(this) searchBlur(this) 
> searchChange(this) searchChange(this) Search  
>  submit titlesearch titlesearch Titles Search Titles  
>  submit fullsearch fullsearch Text Search Full Text  
>  
>  
>  text/javascript 
> &lt;!--// Initialize search form
> var f = document.getElementById('searchform');
> f.getElementsByTagName('label')[0].style.display = 'none';
> var e = document.getElementById('searchinput');
> searchChange(e);
> searchBlur(e);
> //--&gt;
>  
>  logo  rect /solr/FrontPage Solr Wiki  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-9178) ExtractingRequestHandler doesn't strip HTML and adds metadata tags to content body

Reply via email to