Re: Extracting paper/book title from a PDF

Thach Tran Mon, 02 Feb 2009 11:09:05 -0800

Daniele Development-ML wrote:

Hello everybody,
I'm using PDFBox to try to extract some specific text from a PDF file. In
particular, I'm trying to detect the book title, author, and the
bibliographic entries (the references) - the PDF file is printed through the
pdftex command.


Extracting the raw text doesn't help too much as no data is carried with
that. I was therefore trying to browser the document structure and access
the COS objects and get the text value through them. This may just and only
work for the title, and the authors - which both might be written in a
different paragraph.

However, I'm getting a bit confused on the real feasibility of this approach
and on the use of the documentTreeStructure and the COSDictionary.

Has anybody ever faced/solved this problem?
Any comments or suggestions, or pointers to examples? The examples in the
distro seem not to cover this aspect fully, or perhaps I am wrong.

Many thanks,

Dan

Hi Dan!

I wouldn't think you can extract title, author or any "specific" text,for that matter, from what the PDF actually display; and it does notsuppose to be that way too. This is simply because the content of a pagein PDF does not capture any information specifying whether a piece oftext is a title, author, etc. As you said earlier, if I understandcorrectly, you want to get the text in the first paragraph for title andthe text in next paragraph for author, this is also not very feasiblesince again, PDF doesn't not even have knowledge about paragraph.For instance, for a title "My Title", in the content of the page, it mayjust say something like display "My Title" at point x,y.Moreover, for PDF generated by pdftex, the situation is even worst. Inorder to achieve high quality typesetting, the way TeX/LaTeX typesettext is very complex. For example, you could find your title "My Title"is specified as following in the PDF's content:

display "M" at position x1, y1
display "y" at position x2, y2
etc

Your best hope is try to get hold of PDDocumentInformation's object (bycalling getDocumentInformation() on an PDDocument's object) whichrepresented the Info dictionary in the trailer of the PDF file. Thiscould contain the title and author of the PDF file and it's also theappropriate way to store such information in a PDF.However, I would doubt that such information is included in the PDF youare working with since this sort of information is kinda "metainformation" and does not display when viewing the file, so people don'treally care to put that in when making the file.Certainly in the case of pdftex, one has to use package hyperref andimplicitly specifies the title and author with \hypersetup in order toproduce an PDF with that "meta information".

Sorry for my lengthy explanation, just try to make it clear :-)

Cheers,
Thach

Re: Extracting paper/book title from a PDF

Reply via email to