[jira] [Commented] (TIKA-1454) Extracting as HTML loses links in xlsx, ppt, and pptx files

Tim Allison (JIRA) Fri, 13 May 2016 08:00:32 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15282760#comment-15282760
 ]


Tim Allison commented on TIKA-1454:
-----------------------------------

I added preliminary extraction from xlsx, ppt, pptx.

For ppt and pptx, it would be helpful if we could distinguish external (actual 
hyperlinks) from internal (references to a footnote)...this will have to be 
made at the POI level.  For now, there's a bit of a hack to make the 
distinction and only href-ify external.

For xlsx, for now, we are dumping the hyperlinks at the bottom of each sheet.  
If we ran the sheet reader twice, we'd be able to cache the hyperlinks and put 
them in the cells in which they belong.  I'm not sure we want to add that 
double parsing unless there is demand.

For xls, I found no way to extract a hyperlink associated with a text box.  I 
have no doubt that there is a way...I couldn't find it.

We could add more tests for ppt and pptx.

I would close this issue now, but we also have to add extraction for ods and 
odp.

> Extracting as HTML loses links in xlsx, ppt, and pptx files
> -----------------------------------------------------------
>
>                 Key: TIKA-1454
>                 URL: https://issues.apache.org/jira/browse/TIKA-1454
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.6, 1.7, 1.8, 1.9, 1.10, 1.11, 1.12
>         Environment: RedHat EL5, EL6, EL7
>            Reporter: Chris Bryant
>            Assignee: Tim Allison
>             Fix For: 1.14
>
>         Attachments: testurl.ods, testurl.xlsx, urltest.odp, urltest.ppt, 
> urltest.pptx
>
>
> I am trying to convert documents to HTML, then looking through the HTML for 
> anchor tags to find links to external URLs.  This works fine when looking at 
> some document types, including PDFs, Open Document formats, Microsoft Word 
> formats .doc and .docx, and the older Microsoft Excel .xls format, but it 
> does not work for any Microsoft Powerpoint formats (.ppt or .pptx) and it 
> does not work for the newer Excel .xlsx format.  For the .ppt, .pptx, and 
> .xlsx formats, the text is extracted properly and formatted into HTML, but 
> the link is not converted to an anchor tag.
> I am running tika in --server --html mode.
> I included samples of .xlsx, .ppt, and .pptx files that do not properly 
> extract links, and also included samples of .ods and .odp files that do 
> extract links properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1454) Extracting as HTML loses links in xlsx, ppt, and pptx files

Reply via email to