Sara Miller created TIKA-2177:
---------------------------------
Summary: microsoft.OfficeParser shows add links in additional
paragraphs
Key: TIKA-2177
URL: https://issues.apache.org/jira/browse/TIKA-2177
Project: Tika
Issue Type: Bug
Components: server
Affects Versions: 1.13
Environment: org.apache.tika.parser.microsoft.OfficeParser and
org.apache.tika.parser.microsoft.ooxml.OOXMLParser
Reporter: Sara Miller
Priority: Minor
I'm converting Excel files, both .xls and .xlsx.
.xls uses org.apache.tika.parser.microsoft.OfficeParser and
.xlsx uses org.apache.tika.parser.microsoft.ooxml.OOXMLParser
If I have a link in my excel document, for example [email protected], the .xls
parser adds additional elements in the document structure which shows an
incorrect output of how the document looks.
For example, this table in file.xls:
mailadress password
[email protected] hohoho
will output:
<div class="page">
<h1>Sheet1</h1>
<table>
<tbody>
<tr>
<td>mailadress</td>
<td>password</td>
</tr>
<tr>
<td>[email protected]</td>
<td>hohoho</td>
</tr>
</tbody>
</table>
<div class="outside">
<a href="mailto:[email protected]">[email protected]</a>
</div>
</div>
The <div class="outside"> should be removed because it does not correspond to
the document structure.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)