Florent Valdelievre created TIKA-2018:
-----------------------------------------

             Summary: Attempt to get Title from Full text if not present in 
MetaData ( Application/Pdf )
                 Key: TIKA-2018
                 URL: https://issues.apache.org/jira/browse/TIKA-2018
             Project: Tika
          Issue Type: Improvement
            Reporter: Florent Valdelievre
            Priority: Minor


A vast majority of pdf documents don't fill meta information. 
As a matter of fact, Tika won't be able to get information like the title.

There is a [nice 
scientific|http://docear.org/papers/SciPlore%20Xtract%20--%20Extracting%20Titles%20from%20Scientific%20PDF%20Documents%20by%20Analyzing%20Style%20Information%20%28Font%20Size%29-preprint.pdf]
 document explaining how to get the title from styles present in the document 
with simple rules based heuristic. We can probably ask the source code on 
request if necessary.

Also, I have tested another lib https://github.com/Docear/PDF-Inspector which 
does a great job. However, it seems to work exclusively using File object which 
is not relevant with Hadoop and Nutch context, It would have been nice if it 
would have worked with stream.

What do you think ? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to