[ 
https://issues.apache.org/jira/browse/TIKA-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15282751#comment-15282751
 ] 

Hudson commented on TIKA-1454:
------------------------------

SUCCESS: Integrated in tika-trunk-jdk1.7 #989 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/989/])
TIKA-1454 -- added initial hyperlink extraction for ppt, pptx, xlsx.  
(tallison: rev 69852e4cb55d34e6513e0b66af7d75cb1b1408ba)
* tika-parsers/src/test/resources/test-documents/testEXCEL_hyperlinks.xlsx
* 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java
* 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
* tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java
* 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
* 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
* 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java
* tika-parsers/src/test/resources/test-documents/testEXCEL_hyperlinks.xls


> Extracting as HTML loses links in xlsx, ppt, and pptx files
> -----------------------------------------------------------
>
>                 Key: TIKA-1454
>                 URL: https://issues.apache.org/jira/browse/TIKA-1454
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.6, 1.7, 1.8, 1.9, 1.10, 1.11, 1.12
>         Environment: RedHat EL5, EL6, EL7
>            Reporter: Chris Bryant
>            Assignee: Tim Allison
>         Attachments: testurl.ods, testurl.xlsx, urltest.odp, urltest.ppt, 
> urltest.pptx
>
>
> I am trying to convert documents to HTML, then looking through the HTML for 
> anchor tags to find links to external URLs.  This works fine when looking at 
> some document types, including PDFs, Open Document formats, Microsoft Word 
> formats .doc and .docx, and the older Microsoft Excel .xls format, but it 
> does not work for any Microsoft Powerpoint formats (.ppt or .pptx) and it 
> does not work for the newer Excel .xlsx format.  For the .ppt, .pptx, and 
> .xlsx formats, the text is extracted properly and formatted into HTML, but 
> the link is not converted to an anchor tag.
> I am running tika in --server --html mode.
> I included samples of .xlsx, .ppt, and .pptx files that do not properly 
> extract links, and also included samples of .ods and .odp files that do 
> extract links properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to