[
https://issues.apache.org/jira/browse/TIKA-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15989011#comment-15989011
]
ASF GitHub Bot commented on TIKA-2347:
--------------------------------------
stuarthendren opened a new pull request #173: Fix for TIKA-2347 Adds underline
extraction from word documents
URL: https://github.com/apache/tika/pull/173
Extracts underline for both doc and docx and assigns tag <u>.
Given lowest nesting among style tags.
Adds tests using testWORD_various.doc and testWord_various.docx
Updates affected output in other WordParserTests.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Underlined text is not decorated as such when extracting from word documents
> ----------------------------------------------------------------------------
>
> Key: TIKA-2347
> URL: https://issues.apache.org/jira/browse/TIKA-2347
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 2.0, 1.14
> Reporter: Stuart Hendren
>
> When extracting from doc and docx bold and italic text decoration is
> extracted, however underlining is not. Can be demonstrated in WordParserTest
> or OOXMLParserTest (change to docx) with the following test case.
> {code:title=WordParserTest.java|borderStyle=solid}
> @Test
> public void testTextDecoration() throws Exception {
> XMLResult result = getXML("testWORD_various.doc");
> String xml = result.xml;
> assertTrue(xml.contains("<b>Bold</b>"));
> assertTrue(xml.contains("<i>italic</i>"));
> assertTrue(xml.contains("<u>underline</u>"));
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)