Re: [poppler] Extract title from pdf file.

Leonard Rosenthol Thu, 10 Nov 2011 14:15:59 -0800

I am sorry to be pedantic, but this is EXTREMELY IMPORTANT…

What you are doing is adding HEURISTICS into Poppler to GUESS at the logical 
structure of a PDF.  You are NOT actually taking into account any REAL LIVE 
logical structure that was put their by the PDF producer.


PDF 1.3 is about 15 YEARS OLD.  NUMEROUS ADVANCES have been made to the format. 
 PDF is currently at 1.7, as standardized by the ISO and adopted as national 
standards by almost 50 countries around the world.  Version 2.0 (ISO 32000-2) 
is almost complete!  To work only with 1.3 is, honestly, a waste.  You are 
missing HUGE PIECES of functionality found in the majority of real-world 
documents.

I am sure your code is wonderful.  However, given that it is based on 1.3 and 
does not recognize existing PDF structure, it seems SEVERELY limited in real 
world use.

Leonard

From: Alec Taylor <[email protected]<mailto:[email protected]>>
Date: Thu, 10 Nov 2011 13:57:54 -0800
To: Leonard Rosenthol <[email protected]<mailto:[email protected]>>
Cc: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>, Albert 
Cid <[email protected]<mailto:[email protected]>>
Subject: Re: [poppler] Extract title from pdf file.


As was previously mentioned, I am adding the semantic and logical structuring 
into poppler core.

My plan is to figure out what fits into which category by post processing the 
XML. Any suggestions on how to reverse [or post?!] engineer this XML back into 
the PDF would be appreciated.

In a few days I will have a very accurate XML genereated with 
<header></header>, <footer></footer> and table of contents tags.

This will involve the "pushing" of the actual "printed" page numbers, and 
adding hyperlink to each ToC entry, and partitioning the page structure as far 
as the 1.3 standard allows.

My code is extremely modular, neat & efficient, and included the writing of an 
OO API. So it should be easily extendable with author, title, publisher, year 
and section title extraction capabilities.

_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

Re: [poppler] Extract title from pdf file.

Reply via email to