[
https://issues.apache.org/jira/browse/TIKA-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929726#comment-15929726
]
Sara Miller commented on TIKA-2177:
-----------------------------------
I understand, it is ok for us to leave it as it is, we can solve this in other
ways from our side.
Thank you for checking!
> microsoft.OfficeParser shows add links in additional paragraphs
> ---------------------------------------------------------------
>
> Key: TIKA-2177
> URL: https://issues.apache.org/jira/browse/TIKA-2177
> Project: Tika
> Issue Type: Bug
> Components: server
> Affects Versions: 1.13
> Environment: org.apache.tika.parser.microsoft.OfficeParser and
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser
> Reporter: Sara Miller
> Priority: Minor
>
> I'm converting Excel files, both .xls and .xlsx.
> .xls uses org.apache.tika.parser.microsoft.OfficeParser and
> .xlsx uses org.apache.tika.parser.microsoft.ooxml.OOXMLParser
> If I have a link in my excel document, for example [email protected], the .xls
> parser adds additional elements in the document structure which shows an
> incorrect output of how the document looks.
> For example, this table in file.xls:
> mailadress password
> [email protected] hohoho
> will output:
> <div class="page">
> <h1>Sheet1</h1>
> <table>
> <tbody>
> <tr>
> <td>mailadress</td>
> <td>password</td>
> </tr>
> <tr>
> <td>[email protected]</td>
> <td>hohoho</td>
> </tr>
> </tbody>
> </table>
> <div class="outside">
> <a href="mailto:[email protected]">[email protected]</a>
> </div>
> </div>
> The <div class="outside"> should be removed because it does not correspond to
> the document structure.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)