Florent Valdelievre created TIKA-2018:
-----------------------------------------
Summary: Attempt to get Title from Full text if not present in
MetaData ( Application/Pdf )
Key: TIKA-2018
URL: https://issues.apache.org/jira/browse/TIKA-2018
Project: Tika
Issue Type: Improvement
Reporter: Florent Valdelievre
Priority: Minor
A vast majority of pdf documents don't fill meta information.
As a matter of fact, Tika won't be able to get information like the title.
There is a [nice
scientific|http://docear.org/papers/SciPlore%20Xtract%20--%20Extracting%20Titles%20from%20Scientific%20PDF%20Documents%20by%20Analyzing%20Style%20Information%20%28Font%20Size%29-preprint.pdf]
document explaining how to get the title from styles present in the document
with simple rules based heuristic. We can probably ask the source code on
request if necessary.
Also, I have tested another lib https://github.com/Docear/PDF-Inspector which
does a great job. However, it seems to work exclusively using File object which
is not relevant with Hadoop and Nutch context, It would have been nice if it
would have worked with stream.
What do you think ?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)