[ https://issues.apache.org/jira/browse/TIKA-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348551#comment-15348551 ]
Tim Allison commented on TIKA-2018: ----------------------------------- I'm not against implementing some basic heuristics based on font size to extract the title as long as we keep the "extracted title" under a separate key from the actual "dc:title" (and, y, I agree "when they are set" is a key limitation). PDF-Inspector's license (GPL 2 or later) is not compatible with Apache. Can you submit a patch or recommend implementation details from, say, Beel et al? > Attempt to get Title from Full text if not present in MetaData ( > Application/Pdf ) > ---------------------------------------------------------------------------------- > > Key: TIKA-2018 > URL: https://issues.apache.org/jira/browse/TIKA-2018 > Project: Tika > Issue Type: Improvement > Reporter: Florent Valdelievre > Priority: Minor > > A vast majority of pdf documents don't fill meta information. > As a matter of fact, Tika won't be able to get information like the title. > There is a [nice > scientific|http://docear.org/papers/SciPlore%20Xtract%20--%20Extracting%20Titles%20from%20Scientific%20PDF%20Documents%20by%20Analyzing%20Style%20Information%20%28Font%20Size%29-preprint.pdf] > document explaining how to get the title from styles present in the document > with simple rules based heuristic. We can probably ask the source code on > request if necessary. > Also, I have tested another lib https://github.com/Docear/PDF-Inspector which > does a great job. However, it seems to work exclusively using File object > which is not relevant with Hadoop and Nutch context, It would have been nice > if it would have worked with stream. > What do you think ? -- This message was sent by Atlassian JIRA (v6.3.4#6332)