Parse links in PDF ------------------ Key: TIKA-861 URL: https://issues.apache.org/jira/browse/TIKA-861 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.0 Reporter: Sasha Goodman Priority: Minor Fix For: 1.1
Currently the XHTML doesn't contain links, although PDFBox parses them. I'm new to Tika and haven't done java for 6 years, but someone more experienced could probably do this in a few hours. The PDF2XHTML method loops through the annotations. See: {code:java} 136: for(Object o : page.getAnnotations()) { {code} I found some code for dealing with links in annotations: http://stackoverflow.com/questions/7174709/pdfbox-not-recognizing-a-link It involves checking the class. {code:java} if( annotation instanceof PDAnnotationLink ) { PDAnnotationLink link = (PDAnnotationLink)annotation; {code} I hope this helps someone. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira