From: Adrian Perez de Castro <[email protected]> Hello to all!
We have been working at Igalia to improve the accessibility support in Evince, and as a part of this work we wanted Tagged-PDF support in poppler (and poppler-glib). Apart from allowing accessibility technologies to know the logical structure of documents and their layout attributes, supporting this feature allows for a number of niceties like better support for exporting to other formats, reflowing documents (for example in devices with small screens), etc. The first two patches add the actual low-level support. Let me elaborate on this one, as it is the biggest (and most complex) part. (Note: "/SlashedPrefix" names refer to actual items in the PDF.) * The structure tree is mostly read-only for the moment. There are some setters here and there, but no real support to create a structure tree programmatically. There is no support for writing to file either. * Catalog no longer uses an Object for the /StructTreeRoot, but an instance of StructTreeRoot. * Using Catalog::getStructTreeRoot() or PDFDoc::getStructTreeRoot() will parse the structure tree when first accessed. All the tree is created at once, the main work being done by StructTreeRoot::parse() and StructElement::parse(). * The Catalog owns the StructTreeRoot and its StructElements. * StructTreeRoot keeps references to /ClassMap and /RoleMap, which may be needed when parsing /StructElem objects. This is why StructElement objects have a reference to their StructTreeRoot. * std::vector<> is used for lists of children elements in StructTreeRoot and StructElement. IMHO, it is better to use "a bit more modern C++" for code which does not need to be merged/rebased with changes coming from new potential Xpdf releases. * To extract the actual content referenced by an element, I have implemented a output device (MCOutputDev), which takes a MCID and records the painting operations in the page stream for that one MCID. Character drawing, changes in font faces, and font style (italics/bold/fixed-width) are recorded. Recording is done using MCOp structures (short for Marked Content Operation). Once a page (or range of pages) have been "displayed" using a MCOutputDev, the list of operations can be obtained with MCOutputDev::getMCOps(). As an example: to obtain only the text contents iterate over the list, and pick only the MCOp with type==mcOpUnichar, converting them into a string as you go. (There's Complete example of this in StructElement::getText()). Initially I tried subclassing TextOutputDev, but its innards are done in such a way that skipping the content and picking only the parts marked with a particular MCID would leave it in an inconsistent state -- making it segfault. * Parsing tries to be tolerant and continue reading as much information as possible before bailing bailing out. Nevertheless, checks on the parsed data are done and warnings are printed using errSyntaxError in a number of places. Known issues / TODOs: * Lookups in /RoleMap should be recursive and able to detect loops. (I am working on this while waiting for feedback from the code review :D) * Object References (/OBJR) are not handled. I have not seen PDFs using those, with the references pointing to text. As the focus is improving accessibility, I left this out unimplemented for the moment. * Marked Content Reference objects (/MCR) can contain a reference to the exact stream which has the actual content. Those are ignored, as having the page reference (/Pg) and /MCID is enough. Also, it did look to me that it would be a bit cumbersome to interpret a single stream with the existing poppler APIs (Suggestions and hints are welcome!). * Attribute inheritance is not handled very well when the /Placement of an element is specified and it is other than the default (e.g. if an inline element like /Span has set /Placement/Block). Other / Misc: * There is initial poppler-glib support, which exposes only a subset of the low-level functionality. I will be updating this in the next days, but I wanted to include it to have some feedback about the API. * Bonus: there is a patch to add a new pane in the poppler-glib demo with the document structure. It is a bit crude, but serves as an usage example of the API. * Bonus (x2): There is a patch for pdfinfo which will make it print the document structure when invoked as "pdfinfo -struct" or "pdfinfo -struct-text" (the later including each element's text). Very useful for debugging. * Bonus (x3): I have cleaned up some test code and used it to make a "pdfstructohtml" utility. It is very simplistic for the moment, yet the resulting HTML it produces is quite clean and neat for PDF files without an overly complex That is all for now, I will be also attaching the patches to the relevant bugs (which I created some days ago, as dependencies on a meta-bug [1] tracking all the Tagged-PDF parts). All the feedback/critique you can provide will be handy :-) Best regards, -Adrian --- [1] https://bugs.freedesktop.org/show_bug.cgi?id=tagged-pdf ---- Adrian Perez de Castro (6): Tagged-PDF: Accessors in Catalog for the MarkInfo dictionary Tagged-PDF: Interpret the document structure Tagged-PDF: Modify pdfinfo to show the document structure Tagged-PDF: Implement the utils/pdfstructtohtml tool Tagged-PDF: Expose the structure tree in poppler-glib Tagged-PDF: Pane in poppler-glib demo showing the structure glib/Makefile.am | 4 + glib/demo/Makefile.am | 2 + glib/demo/main.c | 2 + glib/demo/taggedstruct.c | 230 ++++++ glib/demo/taggedstruct.h | 31 + glib/poppler-document.cc | 22 + glib/poppler-document.h | 1 + glib/poppler-private.h | 24 + glib/poppler-structure-element.cc | 1289 +++++++++++++++++++++++++++++++++ glib/poppler-structure-element.h | 346 +++++++++ glib/poppler-structure.cc | 349 +++++++++ glib/poppler-structure.h | 43 ++ glib/poppler.h | 3 + glib/reference/poppler-docs.sgml | 2 + glib/reference/poppler-sections.txt | 86 +++ glib/reference/poppler.types | 2 + poppler/Catalog.cc | 81 ++- poppler/Catalog.h | 15 +- poppler/MCOutputDev.cc | 145 ++++ poppler/MCOutputDev.h | 108 +++ poppler/Makefile.am | 6 + poppler/PDFDoc.h | 3 +- poppler/StructElement.cc | 1361 +++++++++++++++++++++++++++++++++++ poppler/StructElement.h | 273 +++++++ poppler/StructTreeRoot.cc | 120 +++ poppler/StructTreeRoot.h | 56 ++ utils/Makefile.am | 5 + utils/pdfinfo.cc | 97 ++- utils/pdfstructtohtml.cc | 387 ++++++++++ 29 files changed, 5074 insertions(+), 19 deletions(-) create mode 100644 glib/demo/taggedstruct.c create mode 100644 glib/demo/taggedstruct.h create mode 100644 glib/poppler-structure-element.cc create mode 100644 glib/poppler-structure-element.h create mode 100644 glib/poppler-structure.cc create mode 100644 glib/poppler-structure.h create mode 100644 poppler/MCOutputDev.cc create mode 100644 poppler/MCOutputDev.h create mode 100644 poppler/StructElement.cc create mode 100644 poppler/StructElement.h create mode 100644 poppler/StructTreeRoot.cc create mode 100644 poppler/StructTreeRoot.h create mode 100644 utils/pdfstructtohtml.cc -- 1.8.3 _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
