[
https://issues.apache.org/jira/browse/SOLR-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Simon Blandford updated SOLR-9178:
----------------------------------
Description:
Starting environment:
solr-6.0.1.tgz is downloaded and extracted. We are in the solr-6.0.1 directory.
The file, test.html, is downloaded from
https://wiki.apache.org/solr/UsingMailingLists.
Steps to reproduce:
1) bin/solr start
2) bin/solr create -c mycore
3) curl
"http://localhost:8983/solr/mycore/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true"
-F "content/[email protected]"
4) curl http://localhost:8983/solr/mycore/select?q=information
Expected result: HTML->Text version of document indexed in <response> content
body.
Actual result: Full HTML, but with anglebrackets removed, being indexed along
with other unwanted metadata in the <response> body including fragments of CSS
and Javascript that were in the source document.
Head of response body below...
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int
name="QTime">0</int><lst name="params"><str
name="q">information</str></lst></lst><result name="response" numFound="1"
start="0"><doc><str name="id">doc1</str><arr
name="attr_stream_size"><str>20440</str></arr><arr
name="attr_x_parsed_by"><str>org.apache.tika.parser.DefaultParser</str><str>org.apache.tika.parser.html.HtmlParser</str></arr><arr
name="attr_stream_content_type"><str>text/html</str></arr><arr
name="attr_stream_name"><str>test.html</str></arr><arr
name="attr_stream_source_info"><str>content/tutorial</str></arr><arr
name="attr_dc_title"><str>UsingMailingLists - Solr Wiki</str></arr><arr
name="attr_content_encoding"><str>UTF-8</str></arr><arr
name="attr_robots"><str>index,nofollow</str></arr><arr
name="attr_title"><str>UsingMailingLists - Solr Wiki</str></arr><arr
name="attr_content_type"><str>text/html; charset=utf-8</str></arr><arr
name="attr_content"><str>
stylesheet text/css utf-8 all /wiki/modernized/css/common.css stylesheet
text/css utf-8 screen /wiki/modernized/css/screen.css stylesheet text/css
utf-8 print /wiki/modernized/css/print.css stylesheet text/css utf-8
projection /wiki/modernized/css/projection.css alternate Solr Wiki:
UsingMailingLists
/solr/UsingMailingLists?diffs=1&show_att=1&action=rss_rc&unique=0&page=UsingMailingLists&ddiffs=1
application/rss+xml Start /solr/FrontPage Alternate Wiki Markup
/solr/UsingMailingLists?action=raw Alternate print Print View
/solr/UsingMailingLists?action=print Search /solr/FindPage Index
/solr/TitleIndex Glossary /solr/WordIndex Help /solr/HelpOnFormatting
stream_size 20440
X-Parsed-By org.apache.tika.parser.DefaultParser
X-Parsed-By org.apache.tika.parser.html.HtmlParser
stream_content_type text/html
stream_name test.html
stream_source_info content/tutorial
dc:title UsingMailingLists - Solr Wiki
Content-Encoding UTF-8
robots index,nofollow
Content-Type text/html; charset=utf-8
UsingMailingLists - Solr Wiki
header
application/x-www-form-urlencoded get searchform /solr/UsingMailingLists
hidden action fullsearch
hidden context 180
searchinput Search:
text searchinput value 20 searchFocus(this) searchBlur(this)
searchChange(this) searchChange(this) Search
submit titlesearch titlesearch Titles Search Titles
submit fullsearch fullsearch Text Search Full Text
text/javascript
<!--// Initialize search form
var f = document.getElementById('searchform');
f.getElementsByTagName('label')[0].style.display = 'none';
var e = document.getElementById('searchinput');
searchChange(e);
searchBlur(e);
//-->
logo rect /solr/FrontPage Solr Wiki
was:
Starting environment:
solr-6.0.1.tgz is downloaded and extracted. We are in the solr-6.0.1 directory.
The file, test.html, is downloaded from
https://wiki.apache.org/solr/UsingMailingLists.
Steps to reproduce:
1) bin/solr start
2) bin/solr create -c mycore
3) curl
"http://localhost:8983/solr/mycore/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true"
-F "content/[email protected]"
4) curl http://localhost:8983/solr/mycore/select?q=information
Expected result: HTML->Text version of document indexed in <response> body.
Actual result: Full HTML, but with anglebrackets removed, being indexed along
with other unwanted metadata in the <response> body including fragments of CSS
and Javascript that were in the source document.
Head of response body below...
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int
name="QTime">0</int><lst name="params"><str
name="q">information</str></lst></lst><result name="response" numFound="1"
start="0"><doc><str name="id">doc1</str><arr
name="attr_stream_size"><str>20440</str></arr><arr
name="attr_x_parsed_by"><str>org.apache.tika.parser.DefaultParser</str><str>org.apache.tika.parser.html.HtmlParser</str></arr><arr
name="attr_stream_content_type"><str>text/html</str></arr><arr
name="attr_stream_name"><str>test.html</str></arr><arr
name="attr_stream_source_info"><str>content/tutorial</str></arr><arr
name="attr_dc_title"><str>UsingMailingLists - Solr Wiki</str></arr><arr
name="attr_content_encoding"><str>UTF-8</str></arr><arr
name="attr_robots"><str>index,nofollow</str></arr><arr
name="attr_title"><str>UsingMailingLists - Solr Wiki</str></arr><arr
name="attr_content_type"><str>text/html; charset=utf-8</str></arr><arr
name="attr_content"><str>
stylesheet text/css utf-8 all /wiki/modernized/css/common.css stylesheet
text/css utf-8 screen /wiki/modernized/css/screen.css stylesheet text/css
utf-8 print /wiki/modernized/css/print.css stylesheet text/css utf-8
projection /wiki/modernized/css/projection.css alternate Solr Wiki:
UsingMailingLists
/solr/UsingMailingLists?diffs=1&show_att=1&action=rss_rc&unique=0&page=UsingMailingLists&ddiffs=1
application/rss+xml Start /solr/FrontPage Alternate Wiki Markup
/solr/UsingMailingLists?action=raw Alternate print Print View
/solr/UsingMailingLists?action=print Search /solr/FindPage Index
/solr/TitleIndex Glossary /solr/WordIndex Help /solr/HelpOnFormatting
stream_size 20440
X-Parsed-By org.apache.tika.parser.DefaultParser
X-Parsed-By org.apache.tika.parser.html.HtmlParser
stream_content_type text/html
stream_name test.html
stream_source_info content/tutorial
dc:title UsingMailingLists - Solr Wiki
Content-Encoding UTF-8
robots index,nofollow
Content-Type text/html; charset=utf-8
UsingMailingLists - Solr Wiki
header
application/x-www-form-urlencoded get searchform /solr/UsingMailingLists
hidden action fullsearch
hidden context 180
searchinput Search:
text searchinput value 20 searchFocus(this) searchBlur(this)
searchChange(this) searchChange(this) Search
submit titlesearch titlesearch Titles Search Titles
submit fullsearch fullsearch Text Search Full Text
text/javascript
<!--// Initialize search form
var f = document.getElementById('searchform');
f.getElementsByTagName('label')[0].style.display = 'none';
var e = document.getElementById('searchinput');
searchChange(e);
searchBlur(e);
//-->
logo rect /solr/FrontPage Solr Wiki
Summary: ExtractingRequestHandler doesn't strip HTML and adds metadata
tags to content body (was: ExtractingRequestHandler doesn't strip HTML and
adds metadata tags to indexed body)
> ExtractingRequestHandler doesn't strip HTML and adds metadata tags to content
> body
> ----------------------------------------------------------------------------------
>
> Key: SOLR-9178
> URL: https://issues.apache.org/jira/browse/SOLR-9178
> Project: Solr
> Issue Type: Bug
> Components: update
> Affects Versions: 6.0.1
> Environment: java version "1.8.0_91" 64 bit
> Linux Mint 17, 64 bit
> Reporter: Simon Blandford
>
> Starting environment:
> solr-6.0.1.tgz is downloaded and extracted. We are in the solr-6.0.1
> directory.
> The file, test.html, is downloaded from
> https://wiki.apache.org/solr/UsingMailingLists.
> Steps to reproduce:
> 1) bin/solr start
> 2) bin/solr create -c mycore
> 3) curl
> "http://localhost:8983/solr/mycore/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true"
> -F "content/[email protected]"
> 4) curl http://localhost:8983/solr/mycore/select?q=information
> Expected result: HTML->Text version of document indexed in <response> content
> body.
> Actual result: Full HTML, but with anglebrackets removed, being indexed along
> with other unwanted metadata in the <response> body including fragments of
> CSS and Javascript that were in the source document.
> Head of response body below...
> <?xml version="1.0" encoding="UTF-8"?>
> <response>
> <lst name="responseHeader"><int name="status">0</int><int
> name="QTime">0</int><lst name="params"><str
> name="q">information</str></lst></lst><result name="response" numFound="1"
> start="0"><doc><str name="id">doc1</str><arr
> name="attr_stream_size"><str>20440</str></arr><arr
> name="attr_x_parsed_by"><str>org.apache.tika.parser.DefaultParser</str><str>org.apache.tika.parser.html.HtmlParser</str></arr><arr
> name="attr_stream_content_type"><str>text/html</str></arr><arr
> name="attr_stream_name"><str>test.html</str></arr><arr
> name="attr_stream_source_info"><str>content/tutorial</str></arr><arr
> name="attr_dc_title"><str>UsingMailingLists - Solr Wiki</str></arr><arr
> name="attr_content_encoding"><str>UTF-8</str></arr><arr
> name="attr_robots"><str>index,nofollow</str></arr><arr
> name="attr_title"><str>UsingMailingLists - Solr Wiki</str></arr><arr
> name="attr_content_type"><str>text/html; charset=utf-8</str></arr><arr
> name="attr_content"><str>
>
> stylesheet text/css utf-8 all /wiki/modernized/css/common.css stylesheet
> text/css utf-8 screen /wiki/modernized/css/screen.css stylesheet text/css
> utf-8 print /wiki/modernized/css/print.css stylesheet text/css utf-8
> projection /wiki/modernized/css/projection.css alternate Solr Wiki:
> UsingMailingLists
> /solr/UsingMailingLists?diffs=1&show_att=1&action=rss_rc&unique=0&page=UsingMailingLists&ddiffs=1
> application/rss+xml Start /solr/FrontPage Alternate Wiki Markup
> /solr/UsingMailingLists?action=raw Alternate print Print View
> /solr/UsingMailingLists?action=print Search /solr/FindPage Index
> /solr/TitleIndex Glossary /solr/WordIndex Help /solr/HelpOnFormatting
> stream_size 20440
> X-Parsed-By org.apache.tika.parser.DefaultParser
> X-Parsed-By org.apache.tika.parser.html.HtmlParser
> stream_content_type text/html
> stream_name test.html
> stream_source_info content/tutorial
> dc:title UsingMailingLists - Solr Wiki
> Content-Encoding UTF-8
> robots index,nofollow
> Content-Type text/html; charset=utf-8
> UsingMailingLists - Solr Wiki
>
>
> header
> application/x-www-form-urlencoded get searchform /solr/UsingMailingLists
>
> hidden action fullsearch
> hidden context 180
> searchinput Search:
> text searchinput value 20 searchFocus(this) searchBlur(this)
> searchChange(this) searchChange(this) Search
> submit titlesearch titlesearch Titles Search Titles
> submit fullsearch fullsearch Text Search Full Text
>
>
> text/javascript
> <!--// Initialize search form
> var f = document.getElementById('searchform');
> f.getElementsByTagName('label')[0].style.display = 'none';
> var e = document.getElementById('searchinput');
> searchChange(e);
> searchBlur(e);
> //-->
>
> logo rect /solr/FrontPage Solr Wiki
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]