Markus Schuch created SOLR-4908: ----------------------------------- Summary: SolrContentHandler procuces glued words when extracting html Key: SOLR-4908 URL: https://issues.apache.org/jira/browse/SOLR-4908 Project: Solr Issue Type: Bug Components: contrib - Solr Cell (Tika extraction) Affects Versions: 4.3 Environment: Windows 7, 64bit, Solr 4.3 example Reporter: Markus Schuch Attachments: tika-test.html
The SolrContentHandler produces glued words when extracting html for html documents like: {code} <html><head></head><body>glued<br/>words</body></html> {code} This was solved in Tika [TIKA-343] but the problem occurs when using the extraction handler because the SolrContentHandler discards ignoreableWhitespace. The Tika XHTMLContentHandler issues ignoreableWhitspace events with a newline in the character stream when a <br> tag is encountered. The SolrContentHandler should be modified to add the ignorable whitespace to the content. Reproduction Steps: # POST the html example file from the attachtments to http://localhost:8983/solr/update/extract?literal.id=html-test-1&commit=true (e.g. with curl or HTTP Requester Plugin in Firefox) # Query for the document http://localhost:8983/solr/collection1/select?q=id%3A%22html-test-1%22&fl=content&wt=xml&indent=true -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org