[
https://issues.apache.org/jira/browse/TIKA-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348551#comment-15348551
]
Tim Allison commented on TIKA-2018:
-----------------------------------
I'm not against implementing some basic heuristics based on font size to
extract the title as long as we keep the "extracted title" under a separate key
from the actual "dc:title" (and, y, I agree "when they are set" is a key
limitation).
PDF-Inspector's license (GPL 2 or later) is not compatible with Apache.
Can you submit a patch or recommend implementation details from, say, Beel et
al?
> Attempt to get Title from Full text if not present in MetaData (
> Application/Pdf )
> ----------------------------------------------------------------------------------
>
> Key: TIKA-2018
> URL: https://issues.apache.org/jira/browse/TIKA-2018
> Project: Tika
> Issue Type: Improvement
> Reporter: Florent Valdelievre
> Priority: Minor
>
> A vast majority of pdf documents don't fill meta information.
> As a matter of fact, Tika won't be able to get information like the title.
> There is a [nice
> scientific|http://docear.org/papers/SciPlore%20Xtract%20--%20Extracting%20Titles%20from%20Scientific%20PDF%20Documents%20by%20Analyzing%20Style%20Information%20%28Font%20Size%29-preprint.pdf]
> document explaining how to get the title from styles present in the document
> with simple rules based heuristic. We can probably ask the source code on
> request if necessary.
> Also, I have tested another lib https://github.com/Docear/PDF-Inspector which
> does a great job. However, it seems to work exclusively using File object
> which is not relevant with Hadoop and Nutch context, It would have been nice
> if it would have worked with stream.
> What do you think ?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)