Sai Konuri created TIKA-3814:
--------------------------------
Summary: Extracted text from HTML file does not exclude newline
chars from body
Key: TIKA-3814
URL: https://issues.apache.org/jira/browse/TIKA-3814
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 2.3.0
Reporter: Sai Konuri
Attachments: bug.html, image-2022-07-06-19-08-30-437.png,
image-2022-07-06-19-09-54-534.png
When there is a newline character ('\n') within the text of a
<span>,<p>,<text>, etc, the text that is extracted is not excluding those
newlines.
A sample html file is attached.
{*}Expected{*}:
!image-2022-07-06-19-08-30-437.png!
{*}Actual{*}:
!image-2022-07-06-19-09-54-534.png!
This is the code I am using to extract the text of the HTML file:
{code:java}
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
try (InputStream stream =
this.getClass().getClassLoader().getResourceAsStream("bug.html")) {
parser.parse(stream, handler, metadata);
System.out.println(handler);
} {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)