[ 
https://issues.apache.org/jira/browse/TIKA-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348286#comment-15348286
 ] 

Tim Allison commented on TIKA-2018:
-----------------------------------

bq. A vast majority of pdf documents don't fill meta information. 

To confirm, you do _not_ mean that Tika is failing to pull metadata out of the 
metadata component or the XMP within PDFs.

If you're trying to extract "metadata" from the content via nlp or other 
heuristics, have you experimented with the [Grobid Journal 
Parser|https://wiki.apache.org/tika/GrobidJournalParser]?

> Attempt to get Title from Full text if not present in MetaData ( 
> Application/Pdf )
> ----------------------------------------------------------------------------------
>
>                 Key: TIKA-2018
>                 URL: https://issues.apache.org/jira/browse/TIKA-2018
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Florent Valdelievre
>            Priority: Minor
>
> A vast majority of pdf documents don't fill meta information. 
> As a matter of fact, Tika won't be able to get information like the title.
> There is a [nice 
> scientific|http://docear.org/papers/SciPlore%20Xtract%20--%20Extracting%20Titles%20from%20Scientific%20PDF%20Documents%20by%20Analyzing%20Style%20Information%20%28Font%20Size%29-preprint.pdf]
>  document explaining how to get the title from styles present in the document 
> with simple rules based heuristic. We can probably ask the source code on 
> request if necessary.
> Also, I have tested another lib https://github.com/Docear/PDF-Inspector which 
> does a great job. However, it seems to work exclusively using File object 
> which is not relevant with Hadoop and Nutch context, It would have been nice 
> if it would have worked with stream.
> What do you think ? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to