[
https://issues.apache.org/jira/browse/TIKA-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15924360#comment-15924360
]
Tim Allison commented on TIKA-2177:
-----------------------------------
Sorry for taking so long to reply.
I was just looking into this again over on POI. The issue is that hyperlink
addresses (or their reference ids) are actually stored after the sheet data in
xls, xlsx and xlsb. We would have to parse the full sheet data, cache the
hyperlink addresses and then reparse the sheet data.
A hyperlink can have 3 values: display, url and tooltip (at least in xlsx).
The display is (typically) stored in the sheet data in the appropriate cell.
The url and tooltip are stored outside of the sheet data. What you are seeing
in your example is what happens when display==url. It looks like there's a
duplicate.
So, if we made it configurable, would you be willing to double parse each sheet
in order to get the hyperlinks right?
> microsoft.OfficeParser shows add links in additional paragraphs
> ---------------------------------------------------------------
>
> Key: TIKA-2177
> URL: https://issues.apache.org/jira/browse/TIKA-2177
> Project: Tika
> Issue Type: Bug
> Components: server
> Affects Versions: 1.13
> Environment: org.apache.tika.parser.microsoft.OfficeParser and
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser
> Reporter: Sara Miller
> Priority: Minor
>
> I'm converting Excel files, both .xls and .xlsx.
> .xls uses org.apache.tika.parser.microsoft.OfficeParser and
> .xlsx uses org.apache.tika.parser.microsoft.ooxml.OOXMLParser
> If I have a link in my excel document, for example [email protected], the .xls
> parser adds additional elements in the document structure which shows an
> incorrect output of how the document looks.
> For example, this table in file.xls:
> mailadress password
> [email protected] hohoho
> will output:
> <div class="page">
> <h1>Sheet1</h1>
> <table>
> <tbody>
> <tr>
> <td>mailadress</td>
> <td>password</td>
> </tr>
> <tr>
> <td>[email protected]</td>
> <td>hohoho</td>
> </tr>
> </tbody>
> </table>
> <div class="outside">
> <a href="mailto:[email protected]">[email protected]</a>
> </div>
> </div>
> The <div class="outside"> should be removed because it does not correspond to
> the document structure.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)