[
https://issues.apache.org/jira/browse/TIKA-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler resolved TIKA-2683.
-------------------------------
Resolution: Fixed
Fixed via [PR
#243|https://github.com/apache/tika/commit/8851d511c4768a3200eafa06237b99ec263a201a]
from Karanjeet Singh.
> Missing space and inappropriate new-line in Boilerpipe extracted text
> ---------------------------------------------------------------------
>
> Key: TIKA-2683
> URL: https://issues.apache.org/jira/browse/TIKA-2683
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.18
> Environment: Replicable everywhere in all environments
> Reporter: Karanjeet Singh
> Assignee: Ken Krugler
> Priority: Major
> Labels: Boilerplate_Removal, boilerpipe, parser
> Fix For: 1.19
>
>
> Boilerpipe extractor in Tika miss to capture the space and new-line character
> in HTML.
> Also, additional new-line characters are inserted in between the text.
> *Example URL* - [https://en.wikipedia.org/wiki/Blobfish]
> Missing space in "family Psychrolutidae" and additional new-line characters
> around round brackets '('
>
> Related issue reported long back -
> https://issues.apache.org/jira/browse/TIKA-961
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)