Parse links in PDF
------------------

                 Key: TIKA-861
                 URL: https://issues.apache.org/jira/browse/TIKA-861
             Project: Tika
          Issue Type: New Feature
          Components: parser
    Affects Versions: 1.0
            Reporter: Sasha Goodman
            Priority: Minor
             Fix For: 1.1


Currently the XHTML doesn't contain links, although PDFBox parses them. I'm new 
to Tika and haven't done java for 6 years, but someone more experienced could 
probably do this in a few hours. 

The PDF2XHTML method loops through the annotations. 

See: 
{code:java}
136: for(Object o : page.getAnnotations()) {
{code}

 I found some code for dealing with links in annotations:
http://stackoverflow.com/questions/7174709/pdfbox-not-recognizing-a-link

It involves checking the class. 
{code:java}
if( annotation instanceof PDAnnotationLink ) {
                PDAnnotationLink link = (PDAnnotationLink)annotation;
{code}

I hope this helps someone.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to