[
https://issues.apache.org/jira/browse/TIKA-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667821#comment-16667821
]
ASF GitHub Bot commented on TIKA-2599:
--------------------------------------
dameikle closed pull request #254: TIKA-2599: Fixed closing of styles around
Hyperlinks. Contributed by Ronan O'Sullivan.
URL: https://github.com/apache/tika/pull/254
This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:
As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):
diff --git
a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
index 30bd4bb969..6f7d3785bd 100644
---
a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
+++
b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
@@ -528,8 +528,8 @@ private int handleSpecialCharacterRuns(Paragraph p, int
index, boolean skipStyli
url = text.substring(start, end);
}
- xhtml.startElement("a", "href", url);
closeStyleElements(skipStyling, xhtml);
+ xhtml.startElement("a", "href", url);
for (CharacterRun cr : texts) {
handleCharacterRun(cr, skipStyling, xhtml);
}
diff --git
a/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
b/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
index 7456ac409e..d2c38a42d5 100644
---
a/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
+++
b/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
@@ -560,6 +560,15 @@ public void testBoldHyperlink() throws Exception {
assertContains("<a
href=\"http://tika.apache.org/\"><b><u>hyper</u></b><u> link</u></a>; bold" ,
xml);
}
+ @Test
+ public void testHyperlinkSurroundedByItalics() throws Exception {
+ //TIKA-2599
+ String xml = getXML("testWORD_italicsSurroundingHyperlink.doc").xml;
+ xml = xml.replaceAll("\\s+", " ");
+ assertContains("<body><p><i>Italic Test before link </i><a
href=\"http://www.google.com\"><b><i>" +
+ "<u>hyperlink italics</u></i></b></a><i> Italic text after
hyperlink</i></p>", xml);
+ }
+
@Test
public void testMacros() throws Exception {
diff --git
a/tika-parsers/src/test/resources/test-documents/testWord_italicsSurroundingHyperlink.doc
b/tika-parsers/src/test/resources/test-documents/testWord_italicsSurroundingHyperlink.doc
new file mode 100644
index 0000000000..24edb8f718
Binary files /dev/null and
b/tika-parsers/src/test/resources/test-documents/testWord_italicsSurroundingHyperlink.doc
differ
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Hyperlink surrounded by Italics not closed Properly
> ---------------------------------------------------
>
> Key: TIKA-2599
> URL: https://issues.apache.org/jira/browse/TIKA-2599
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.14, 1.15, 1.16, 1.17
> Environment: Any
> Reporter: Ronan O'Sullivan
> Assignee: Dave Meikle
> Priority: Minor
> Fix For: 1.20
>
> Attachments: diff-TIKA-2599.txt,
> testWord_italicsSurroundingHyperlink.doc
>
>
> If a Word document contains a hyperlink surrounded by italicized text, the
> resulting xhtml is:
>
> <p><i>Italic Test before link <a
> href="http://www.google.com"/><b><i><u>hyperlink italics</u></i></b></a><i>
> Italic text after hyperlink</i></p>
>
> The opening italics tag is not closed which is not valid XHTML.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)