[
https://issues.apache.org/jira/browse/TIKA-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler reassigned TIKA-2683:
---------------------------------
Assignee: Ken Krugler
> Missing space and inappropriate new-line in Boilerpipe extracted text
> ---------------------------------------------------------------------
>
> Key: TIKA-2683
> URL: https://issues.apache.org/jira/browse/TIKA-2683
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.18
> Environment: Replicable everywhere in all environments
> Reporter: Karanjeet Singh
> Assignee: Ken Krugler
> Priority: Major
> Labels: Boilerplate_Removal, boilerpipe, parser
> Fix For: 1.19
>
>
> Boilerpipe extractor in Tika miss to capture the space and new-line character
> in HTML.
> Also, additional new-line characters are inserted in between the text.
> *Example URL* - [https://en.wikipedia.org/wiki/Blobfish]
> Missing space in "family Psychrolutidae" and additional new-line characters
> around round brackets '('
>
> Related issue reported long back -
> https://issues.apache.org/jira/browse/TIKA-961
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)