Thanks for your sound explanation Ross, it is the frameset from which I have defined this problem.
On Fri, Nov 11, 2011 at 10:21 AM, Ross Moore <[email protected]> wrote: > Hi Leonard, Josh and Albert, > > On 11/11/2011, at 9:42 AM, Leonard Rosenthol wrote: > >> Albert was looking in the wrong place :). >> >> Check for either the MarkInfo and/or StructTreeRoot key in the Catalog. >> Logical Structure was introduced in PDF 1.3 and Tagged PDF in 1.4 – so these >> features aren't all that new. > > It is true that these are not new. > It is also true, unfortunately, that many PDF-producing software > applications either: > > 1. cannot embed this kind of information; > or > 2. can do some of it, but not all, and may not > do it automatically for all documents; > or > 3. their users do not know how to do what is required to > specify the appropriate Metadata and/or structure; > or > 4. maybe they do know how to, but could not be bothered > to actually do so. > > Without proper training on what is the purpose of metadata, > and why encoding document structure is important or useful, > then this situation is not going to change much. > >> >> They are generated by numerous PDF producers including (but not limited to) >> Adobe Acrobat, MS Office 2007 and later, OpenOffice, pdfTeX, etc. These >> features are required in various international standards such as PDF/A-1a >> and PDF/A-2a as well as the new PDF/UA. > > When one Prints a document to PDF (e.g. in Mac OS X) then a box comes up > allowing Metadata such as Title, Author, Subject, Keywords to be included. > But how many of your colleagues do you know who actually do anything but > accept the default strings? > For Title, the default is just the file name, without the '.' extension. > How useful is that? It adds nothing to what is know from the file name itself. > > I'd expect the applications you list to be similar, but providing a sensible > title, but *only* if the author has done the right thing within the Word > Processing application to declare a piece of text as being *the* title. > >> >> I wish they all used it too…Unfortunately, many less capable PDF producers >> don't support it. > > And that is presumably where Alec's application comes in, for a bunch > of PDFs that were created using software that doesn't provide > adequate Metadata --- or the authors never bothered to use that feature. > > So the aim should be for his software to: > > 1. check whether a document title exists already, > in the DocInfo dictionary, say; > > if not then > > 2. try to find an appropriate piece of text within the document > by applying some heuristics, > > 3. write this into (a new version of) the PDF, making sure to > put it into the correct data structure (i.e. dictionary). My project is to do with header/footer analysis, ToC analysis and the imposition of a logical structure onto PDFs delimiting this information. The due date for completion is the 24th of this month, by then I will have (at the very least): reliable, accurate header/footer extraction into an XML file. I have already done the entire middleware ([input pdf]->[output header/footer in XML]) and have implemented the entire project in an OO API with proper manipulation where expected, and even a new xmltohf project for parallel processing. I will also include a paper outlining methodology, results and a comparison with previous work. Once I have released this project, any of you will be able to easily extend the API with other information such as metadata relating to title, author &etc. What I plan to do for the next 13-days (apart from study and complete two examinations for various unrelated tertiary studies, and complete the final work for a conference I'm running) is: improve the accuracy of my header/footer detection and push the information back into the PDF. I should also have time to separate it into a ToC, and add it to the bookmark "field" of the PDF. Any advice on how I can reverse-engineer the XML into the PDF would be very much appreciated. (i.e. what are the poppler library entry points for inserting bookmarks, and imposing logical structures?) Thanks for all suggestions, Alec Taylor > It should add other appropriate Metadata too, such as Modification > date/time and whatever else in XMP is useful and appropriate. > An RDF block of Metadata might be added as well, and perhaps > even a Colour profile. > I'm sure Leonard could suggest other things too. > > Adding the complete document structure tree is probably asking too > much at this stage --- though that should be an ultimate aim. > This can be a highly complex task, adding such functionality > to existing PDF-producing software. > > To give an example of how I'm working on this very task for pdfTeX > --- in particular adding tagging of mathematical content --- > take a look at this video of a talk that I gave recently at > the TUG 2011 conference: > > http://river-valley.tv/further-advances-toward-tagged-pdf-for-mathematics/ > > This is ongoing work, and I'd appreciate your comments. > > All the best, > > Ross > >> >> Leonard >> >> From: Josh Richardson <[email protected]> >> Date: Thu, 10 Nov 2011 14:28:10 -0800 >> To: Leonard Rosenthol <[email protected]>, Alec Taylor >> <[email protected]> >> Cc: Albert Cid <[email protected]>, "[email protected]" >> <[email protected]>, "[email protected]" >> <[email protected]> >> Subject: Re: [poppler] Extract title from pdf file. >> >> Leonard, I don't understand. You say Alec is "missing HUGE PIECES of >> functionality found in the majority of real-world documents", but Albert >> says he has 1200 documents and none of them has markings. So, which is it, >> or what is it that Alec's missing? >> >> I've got access to more than 10k PDFs, published in the past year or two, >> which I'd be happy to check, if you can tell me how. I'd be curious to know >> how many of them are taking advantage of these newer PDF features, and I'd >> LOVE it if they all were. Sadly, my guess is that it's close to zero. :-( >> >> --josh > > ------------------------------------------------------------------------ > Ross Moore [email protected] > Mathematics Department office: E7A-419 > Macquarie University tel: +61 (0)2 9850 8955 > Sydney, Australia 2109 fax: +61 (0)2 9850 8114 > ------------------------------------------------------------------------ > > > > _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
