[jira] [Commented] (TIKA-2018) Attempt to get Title from Full text if not present in MetaData ( Application/Pdf )

Florent Valdelievre (JIRA) Fri, 24 Jun 2016 09:22:34 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348506#comment-15348506
 ]


Florent Valdelievre commented on TIKA-2018:
-------------------------------------------

Tika is doing a good job in getting Metadata when they are set.
However, I wanted to have a fallback using more complex algorithms to get the 
title ( Like https://github.com/Docear/PDF-Inspector )

Grobid Journal Parser seems to be exclusively made for journal publications.

> Attempt to get Title from Full text if not present in MetaData ( 
> Application/Pdf )
> ----------------------------------------------------------------------------------
>
>                 Key: TIKA-2018
>                 URL: https://issues.apache.org/jira/browse/TIKA-2018
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Florent Valdelievre
>            Priority: Minor
>
> A vast majority of pdf documents don't fill meta information. 
> As a matter of fact, Tika won't be able to get information like the title.
> There is a [nice 
> scientific|http://docear.org/papers/SciPlore%20Xtract%20--%20Extracting%20Titles%20from%20Scientific%20PDF%20Documents%20by%20Analyzing%20Style%20Information%20%28Font%20Size%29-preprint.pdf]
>  document explaining how to get the title from styles present in the document 
> with simple rules based heuristic. We can probably ask the source code on 
> request if necessary.
> Also, I have tested another lib https://github.com/Docear/PDF-Inspector which 
> does a great job. However, it seems to work exclusively using File object 
> which is not relevant with Hadoop and Nutch context, It would have been nice 
> if it would have worked with stream.
> What do you think ? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2018) Attempt to get Title from Full text if not present in MetaData ( Application/Pdf )

Reply via email to