[
https://issues.apache.org/jira/browse/SOLR-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Simon Blandford updated SOLR-9178:
----------------------------------
Affects Version/s: 5.0
Description:
Starting environment:
solr-6.0.1.tgz is downloaded and extracted. We are in the solr-6.0.1 directory.
The file, test.html, is downloaded from
https://wiki.apache.org/solr/UsingMailingLists.
Affected versions: 4.10.3 is the last working version. 4.10.4 has some HTML
comments and Javascript breaking through. Versions >5.0 have full symptoms
described.
Steps to reproduce:
1) bin/solr start
2) bin/solr create -c mycore
3) curl
"http://localhost:8983/solr/mycore/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true"
-F "content/[email protected]"
4) curl http://localhost:8983/solr/mycore/select?q=information
Expected result: HTML->Text version of document indexed in <response> content
body.
Actual result: Full HTML, but with anglebrackets removed, being indexed along
with other unwanted metadata in the content body including fragments of CSS and
Javascript that were in the source document.
Head of response body below...
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int
name="QTime">0</int><lst name="params"><str
name="q">information</str></lst></lst><result name="response" numFound="1"
start="0"><doc><str name="id">doc1</str><arr
name="attr_stream_size"><str>20440</str></arr><arr
name="attr_x_parsed_by"><str>org.apache.tika.parser.DefaultParser</str><str>org.apache.tika.parser.html.HtmlParser</str></arr><arr
name="attr_stream_content_type"><str>text/html</str></arr><arr
name="attr_stream_name"><str>test.html</str></arr><arr
name="attr_stream_source_info"><str>content/tutorial</str></arr><arr
name="attr_dc_title"><str>UsingMailingLists - Solr Wiki</str></arr><arr
name="attr_content_encoding"><str>UTF-8</str></arr><arr
name="attr_robots"><str>index,nofollow</str></arr><arr
name="attr_title"><str>UsingMailingLists - Solr Wiki</str></arr><arr
name="attr_content_type"><str>text/html; charset=utf-8</str></arr><arr
name="attr_content"><str>
stylesheet text/css utf-8 all /wiki/modernized/css/common.css stylesheet
text/css utf-8 screen /wiki/modernized/css/screen.css stylesheet text/css
utf-8 print /wiki/modernized/css/print.css stylesheet text/css utf-8
projection /wiki/modernized/css/projection.css alternate Solr Wiki:
UsingMailingLists
/solr/UsingMailingLists?diffs=1&show_att=1&action=rss_rc&unique=0&page=UsingMailingLists&ddiffs=1
application/rss+xml Start /solr/FrontPage Alternate Wiki Markup
/solr/UsingMailingLists?action=raw Alternate print Print View
/solr/UsingMailingLists?action=print Search /solr/FindPage Index
/solr/TitleIndex Glossary /solr/WordIndex Help /solr/HelpOnFormatting
stream_size 20440
X-Parsed-By org.apache.tika.parser.DefaultParser
X-Parsed-By org.apache.tika.parser.html.HtmlParser
stream_content_type text/html
stream_name test.html
stream_source_info content/tutorial
dc:title UsingMailingLists - Solr Wiki
Content-Encoding UTF-8
robots index,nofollow
Content-Type text/html; charset=utf-8
UsingMailingLists - Solr Wiki
header
application/x-www-form-urlencoded get searchform /solr/UsingMailingLists
hidden action fullsearch
hidden context 180
searchinput Search:
text searchinput value 20 searchFocus(this) searchBlur(this)
searchChange(this) searchChange(this) Search
submit titlesearch titlesearch Titles Search Titles
submit fullsearch fullsearch Text Search Full Text
text/javascript
<!--// Initialize search form
var f = document.getElementById('searchform');
f.getElementsByTagName('label')[0].style.display = 'none';
var e = document.getElementById('searchinput');
searchChange(e);
searchBlur(e);
//-->
logo rect /solr/FrontPage Solr Wiki
was:
Starting environment:
solr-6.0.1.tgz is downloaded and extracted. We are in the solr-6.0.1 directory.
The file, test.html, is downloaded from
https://wiki.apache.org/solr/UsingMailingLists.
Steps to reproduce:
1) bin/solr start
2) bin/solr create -c mycore
3) curl
"http://localhost:8983/solr/mycore/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true"
-F "content/[email protected]"
4) curl http://localhost:8983/solr/mycore/select?q=information
Expected result: HTML->Text version of document indexed in <response> content
body.
Actual result: Full HTML, but with anglebrackets removed, being indexed along
with other unwanted metadata in the content body including fragments of CSS and
Javascript that were in the source document.
Head of response body below...
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int
name="QTime">0</int><lst name="params"><str
name="q">information</str></lst></lst><result name="response" numFound="1"
start="0"><doc><str name="id">doc1</str><arr
name="attr_stream_size"><str>20440</str></arr><arr
name="attr_x_parsed_by"><str>org.apache.tika.parser.DefaultParser</str><str>org.apache.tika.parser.html.HtmlParser</str></arr><arr
name="attr_stream_content_type"><str>text/html</str></arr><arr
name="attr_stream_name"><str>test.html</str></arr><arr
name="attr_stream_source_info"><str>content/tutorial</str></arr><arr
name="attr_dc_title"><str>UsingMailingLists - Solr Wiki</str></arr><arr
name="attr_content_encoding"><str>UTF-8</str></arr><arr
name="attr_robots"><str>index,nofollow</str></arr><arr
name="attr_title"><str>UsingMailingLists - Solr Wiki</str></arr><arr
name="attr_content_type"><str>text/html; charset=utf-8</str></arr><arr
name="attr_content"><str>
stylesheet text/css utf-8 all /wiki/modernized/css/common.css stylesheet
text/css utf-8 screen /wiki/modernized/css/screen.css stylesheet text/css
utf-8 print /wiki/modernized/css/print.css stylesheet text/css utf-8
projection /wiki/modernized/css/projection.css alternate Solr Wiki:
UsingMailingLists
/solr/UsingMailingLists?diffs=1&show_att=1&action=rss_rc&unique=0&page=UsingMailingLists&ddiffs=1
application/rss+xml Start /solr/FrontPage Alternate Wiki Markup
/solr/UsingMailingLists?action=raw Alternate print Print View
/solr/UsingMailingLists?action=print Search /solr/FindPage Index
/solr/TitleIndex Glossary /solr/WordIndex Help /solr/HelpOnFormatting
stream_size 20440
X-Parsed-By org.apache.tika.parser.DefaultParser
X-Parsed-By org.apache.tika.parser.html.HtmlParser
stream_content_type text/html
stream_name test.html
stream_source_info content/tutorial
dc:title UsingMailingLists - Solr Wiki
Content-Encoding UTF-8
robots index,nofollow
Content-Type text/html; charset=utf-8
UsingMailingLists - Solr Wiki
header
application/x-www-form-urlencoded get searchform /solr/UsingMailingLists
hidden action fullsearch
hidden context 180
searchinput Search:
text searchinput value 20 searchFocus(this) searchBlur(this)
searchChange(this) searchChange(this) Search
submit titlesearch titlesearch Titles Search Titles
submit fullsearch fullsearch Text Search Full Text
text/javascript
<!--// Initialize search form
var f = document.getElementById('searchform');
f.getElementsByTagName('label')[0].style.display = 'none';
var e = document.getElementById('searchinput');
searchChange(e);
searchBlur(e);
//-->
logo rect /solr/FrontPage Solr Wiki
> ExtractingRequestHandler doesn't strip HTML and adds metadata to content body
> -----------------------------------------------------------------------------
>
> Key: SOLR-9178
> URL: https://issues.apache.org/jira/browse/SOLR-9178
> Project: Solr
> Issue Type: Bug
> Components: update
> Affects Versions: 5.0, 6.0.1
> Environment: java version "1.8.0_91" 64 bit
> Linux Mint 17, 64 bit
> Reporter: Simon Blandford
>
> Starting environment:
> solr-6.0.1.tgz is downloaded and extracted. We are in the solr-6.0.1
> directory.
> The file, test.html, is downloaded from
> https://wiki.apache.org/solr/UsingMailingLists.
> Affected versions: 4.10.3 is the last working version. 4.10.4 has some HTML
> comments and Javascript breaking through. Versions >5.0 have full symptoms
> described.
> Steps to reproduce:
> 1) bin/solr start
> 2) bin/solr create -c mycore
> 3) curl
> "http://localhost:8983/solr/mycore/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true"
> -F "content/[email protected]"
> 4) curl http://localhost:8983/solr/mycore/select?q=information
> Expected result: HTML->Text version of document indexed in <response> content
> body.
> Actual result: Full HTML, but with anglebrackets removed, being indexed along
> with other unwanted metadata in the content body including fragments of CSS
> and Javascript that were in the source document.
> Head of response body below...
> <?xml version="1.0" encoding="UTF-8"?>
> <response>
> <lst name="responseHeader"><int name="status">0</int><int
> name="QTime">0</int><lst name="params"><str
> name="q">information</str></lst></lst><result name="response" numFound="1"
> start="0"><doc><str name="id">doc1</str><arr
> name="attr_stream_size"><str>20440</str></arr><arr
> name="attr_x_parsed_by"><str>org.apache.tika.parser.DefaultParser</str><str>org.apache.tika.parser.html.HtmlParser</str></arr><arr
> name="attr_stream_content_type"><str>text/html</str></arr><arr
> name="attr_stream_name"><str>test.html</str></arr><arr
> name="attr_stream_source_info"><str>content/tutorial</str></arr><arr
> name="attr_dc_title"><str>UsingMailingLists - Solr Wiki</str></arr><arr
> name="attr_content_encoding"><str>UTF-8</str></arr><arr
> name="attr_robots"><str>index,nofollow</str></arr><arr
> name="attr_title"><str>UsingMailingLists - Solr Wiki</str></arr><arr
> name="attr_content_type"><str>text/html; charset=utf-8</str></arr><arr
> name="attr_content"><str>
>
> stylesheet text/css utf-8 all /wiki/modernized/css/common.css stylesheet
> text/css utf-8 screen /wiki/modernized/css/screen.css stylesheet text/css
> utf-8 print /wiki/modernized/css/print.css stylesheet text/css utf-8
> projection /wiki/modernized/css/projection.css alternate Solr Wiki:
> UsingMailingLists
> /solr/UsingMailingLists?diffs=1&show_att=1&action=rss_rc&unique=0&page=UsingMailingLists&ddiffs=1
> application/rss+xml Start /solr/FrontPage Alternate Wiki Markup
> /solr/UsingMailingLists?action=raw Alternate print Print View
> /solr/UsingMailingLists?action=print Search /solr/FindPage Index
> /solr/TitleIndex Glossary /solr/WordIndex Help /solr/HelpOnFormatting
> stream_size 20440
> X-Parsed-By org.apache.tika.parser.DefaultParser
> X-Parsed-By org.apache.tika.parser.html.HtmlParser
> stream_content_type text/html
> stream_name test.html
> stream_source_info content/tutorial
> dc:title UsingMailingLists - Solr Wiki
> Content-Encoding UTF-8
> robots index,nofollow
> Content-Type text/html; charset=utf-8
> UsingMailingLists - Solr Wiki
>
>
> header
> application/x-www-form-urlencoded get searchform /solr/UsingMailingLists
>
> hidden action fullsearch
> hidden context 180
> searchinput Search:
> text searchinput value 20 searchFocus(this) searchBlur(this)
> searchChange(this) searchChange(this) Search
> submit titlesearch titlesearch Titles Search Titles
> submit fullsearch fullsearch Text Search Full Text
>
>
> text/javascript
> <!--// Initialize search form
> var f = document.getElementById('searchform');
> f.getElementsByTagName('label')[0].style.display = 'none';
> var e = document.getElementById('searchinput');
> searchChange(e);
> searchBlur(e);
> //-->
>
> logo rect /solr/FrontPage Solr Wiki
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]