Daniele Development-ML wrote:
Hello everybody,
I'm using PDFBox to try to extract some specific text from a PDF file. In
particular, I'm trying to detect the book title, author, and the
bibliographic entries (the references) - the PDF file is printed through the
pdftex command.
Extracting the raw text doesn't help too much as no data is carried with
that. I was therefore trying to browser the document structure and access
the COS objects and get the text value through them. This may just and only
work for the title, and the authors - which both might be written in a
different paragraph.
However, I'm getting a bit confused on the real feasibility of this approach
and on the use of the documentTreeStructure and the COSDictionary.
Has anybody ever faced/solved this problem?
Any comments or suggestions, or pointers to examples? The examples in the
distro seem not to cover this aspect fully, or perhaps I am wrong.
Many thanks,
Dan
Hi Dan!
I wouldn't think you can extract title, author or any "specific" text,
for that matter, from what the PDF actually display; and it does not
suppose to be that way too. This is simply because the content of a page
in PDF does not capture any information specifying whether a piece of
text is a title, author, etc. As you said earlier, if I understand
correctly, you want to get the text in the first paragraph for title and
the text in next paragraph for author, this is also not very feasible
since again, PDF doesn't not even have knowledge about paragraph.
For instance, for a title "My Title", in the content of the page, it may
just say something like display "My Title" at point x,y.
Moreover, for PDF generated by pdftex, the situation is even worst. In
order to achieve high quality typesetting, the way TeX/LaTeX typeset
text is very complex. For example, you could find your title "My Title"
is specified as following in the PDF's content:
display "M" at position x1, y1
display "y" at position x2, y2
etc
Your best hope is try to get hold of PDDocumentInformation's object (by
calling getDocumentInformation() on an PDDocument's object) which
represented the Info dictionary in the trailer of the PDF file. This
could contain the title and author of the PDF file and it's also the
appropriate way to store such information in a PDF.
However, I would doubt that such information is included in the PDF you
are working with since this sort of information is kinda "meta
information" and does not display when viewing the file, so people don't
really care to put that in when making the file.
Certainly in the case of pdftex, one has to use package hyperref and
implicitly specifies the title and author with \hypersetup in order to
produce an PDF with that "meta information".
Sorry for my lengthy explanation, just try to make it clear :-)
Cheers,
Thach