[jira] [Updated] (TIKA-1454) Extracting as HTML loses links in xlsx, ppt, and pptx files

Chris Bryant (JIRA) Wed, 22 Oct 2014 12:08:32 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chris Bryant updated TIKA-1454:
-------------------------------
    Description: 
I am trying to convert documents to HTML, then looking through the HTML for 
anchor tags to find links to external URLs.  This works fine when looking at 
some document types, including PDFs, Open Document formats, Microsoft Word 
formats .doc and .docx, and the older Microsoft Excel .xls format, but it does 
not work for any Microsoft Powerpoint formats (.ppt or .pptx) and it does not 
work for the newer Excel .xlsx format.  For the .ppt, .pptx, and .xlsx formats, 
the text is extracted properly and formatted into HTML, but the link is not 
converted to an anchor tag.

I am running tika in --server --html mode.

I included samples of .xlsx, .ppt, and .pptx files that do not properly extract 
links, and also included samples of .ods and .odp files that do extract links 
properly.

  was:
I am trying to convert documents to HTML, then looking through the HTML for 
anchor tags to find links to external URLs.  This works fine when looking at 
some document types, including PDFs, Open Document formats, Microsoft Word 
formats .doc and .docx, and the older Microsoft Excel .xls format, but it does 
not work for any Microsoft Powerpoint formats (.ppt or .pptx) and it does not 
work for the newer Excel .xlsx format.  For the .ppt, .pptx, and .xlsx formats, 
the text is extracted properly and formatted into HTML, but the link is not 
converted to an anchor tag.

I am running tika in --server --html mode.


> Extracting as HTML loses links in xlsx, ppt, and pptx files
> -----------------------------------------------------------
>
>                 Key: TIKA-1454
>                 URL: https://issues.apache.org/jira/browse/TIKA-1454
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.6
>         Environment: I tested this only on RedHat EL5.
>            Reporter: Chris Bryant
>         Attachments: testurl.ods, testurl.xlsx, urltest.odp, urltest.ppt, 
> urltest.pptx
>
>
> I am trying to convert documents to HTML, then looking through the HTML for 
> anchor tags to find links to external URLs.  This works fine when looking at 
> some document types, including PDFs, Open Document formats, Microsoft Word 
> formats .doc and .docx, and the older Microsoft Excel .xls format, but it 
> does not work for any Microsoft Powerpoint formats (.ppt or .pptx) and it 
> does not work for the newer Excel .xlsx format.  For the .ppt, .pptx, and 
> .xlsx formats, the text is extracted properly and formatted into HTML, but 
> the link is not converted to an anchor tag.
> I am running tika in --server --html mode.
> I included samples of .xlsx, .ppt, and .pptx files that do not properly 
> extract links, and also included samples of .ods and .odp files that do 
> extract links properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1454) Extracting as HTML loses links in xlsx, ppt, and pptx files

Reply via email to