[
https://issues.apache.org/jira/browse/TIKA-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15282811#comment-15282811
]
Hudson commented on TIKA-1454:
------------------------------
UNSTABLE: Integrated in tika-2.x #91 (See
[https://builds.apache.org/job/tika-2.x/91/])
TIKA-1454: extract hyperlinks from ppt, pptx and xlsx (tallison: rev
229329d6ea58d5ef90aef7887bdf444463aed127)
*
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
*
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java
* CHANGES.txt
*
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
*
tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
*
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java
*
tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java
*
tika-parser-modules/tika-parser-package-module/src/test/java/org/apache/tika/parser/pkg/ArParserTest.java
TIKA-1454: extract hyperlinks from ppt, pptx and xlsx -- undo ignoring
(tallison: rev 6f5e7f94e6f4f01b4d2a7c453d025f0d1750817a)
*
tika-parser-modules/tika-parser-package-module/src/test/java/org/apache/tika/parser/pkg/ArParserTest.java
> Extracting as HTML loses links in xlsx, ppt, and pptx files
> -----------------------------------------------------------
>
> Key: TIKA-1454
> URL: https://issues.apache.org/jira/browse/TIKA-1454
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.6, 1.7, 1.8, 1.9, 1.10, 1.11, 1.12
> Environment: RedHat EL5, EL6, EL7
> Reporter: Chris Bryant
> Assignee: Tim Allison
> Fix For: 2.0, 1.14
>
> Attachments: testurl.ods, testurl.xlsx, urltest.odp, urltest.ppt,
> urltest.pptx
>
>
> I am trying to convert documents to HTML, then looking through the HTML for
> anchor tags to find links to external URLs. This works fine when looking at
> some document types, including PDFs, Open Document formats, Microsoft Word
> formats .doc and .docx, and the older Microsoft Excel .xls format, but it
> does not work for any Microsoft Powerpoint formats (.ppt or .pptx) and it
> does not work for the newer Excel .xlsx format. For the .ppt, .pptx, and
> .xlsx formats, the text is extracted properly and formatted into HTML, but
> the link is not converted to an anchor tag.
> I am running tika in --server --html mode.
> I included samples of .xlsx, .ppt, and .pptx files that do not properly
> extract links, and also included samples of .ods and .odp files that do
> extract links properly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)