[ https://issues.apache.org/jira/browse/TIKA-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667793#comment-16667793 ]
ASF GitHub Bot commented on TIKA-2599: -------------------------------------- dameikle closed pull request #253: TIKA-2599: Fixed closing of styles around Hyperlinks (by Ronan O'Sullivan) URL: https://github.com/apache/tika/pull/253 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/CHANGES.txt b/CHANGES.txt index 1f793d2f62..187531acf1 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -3,6 +3,9 @@ Release 1.20 - ??? * Use -javaHome or $JAVA_HOME (if they exist) when spawning child in tika-server's -spawnChild mode. + * Fixed closing of styles around Hyperlinks in Word Parser + Contributed by Ronan O'Sullivan (TIKA-2599). + Release 1.19.1 - 10/4/2018 * Update PDFBox to 2.0.12, jempbox to 1.8.16 diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java index 30bd4bb969..6f7d3785bd 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java @@ -528,8 +528,8 @@ private int handleSpecialCharacterRuns(Paragraph p, int index, boolean skipStyli url = text.substring(start, end); } - xhtml.startElement("a", "href", url); closeStyleElements(skipStyling, xhtml); + xhtml.startElement("a", "href", url); for (CharacterRun cr : texts) { handleCharacterRun(cr, skipStyling, xhtml); } diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java index 31bd8ba293..d7d6daee56 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java @@ -570,6 +570,15 @@ public void testBoldHyperlink() throws Exception { assertContains("<a href=\"http://tika.apache.org/\"><b><u>hyper</u></b><u> link</u></a>; bold" , xml); } + @Test + public void testHyperlinkSurroundedByItalics() throws Exception { + //TIKA-2599 + String xml = getXML("testWORD_italicsSurroundingHyperlink.doc").xml; + xml = xml.replaceAll("\\s+", " "); + assertContains("<body><p><i>Italic Test before link </i><a href=\"http://www.google.com\"><b><i>" + + "<u>hyperlink italics</u></i></b></a><i> Italic text after hyperlink</i></p>", xml); + } + @Test public void testMacros() throws Exception { diff --git a/tika-parsers/src/test/resources/test-documents/testWord_italicsSurroundingHyperlink.doc b/tika-parsers/src/test/resources/test-documents/testWord_italicsSurroundingHyperlink.doc new file mode 100644 index 0000000000..24edb8f718 Binary files /dev/null and b/tika-parsers/src/test/resources/test-documents/testWord_italicsSurroundingHyperlink.doc differ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Hyperlink surrounded by Italics not closed Properly > --------------------------------------------------- > > Key: TIKA-2599 > URL: https://issues.apache.org/jira/browse/TIKA-2599 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.14, 1.15, 1.16, 1.17 > Environment: Any > Reporter: Ronan O'Sullivan > Priority: Minor > Attachments: diff-TIKA-2599.txt, > testWord_italicsSurroundingHyperlink.doc > > > If a Word document contains a hyperlink surrounded by italicized text, the > resulting xhtml is: > > <p><i>Italic Test before link <a > href="http://www.google.com"/><b><i><u>hyperlink italics</u></i></b></a><i> > Italic text after hyperlink</i></p> > > The opening italics tag is not closed which is not valid XHTML. -- This message was sent by Atlassian JIRA (v7.6.3#76005)