Summary: [PATCH] Improve generation of PDFs with accessibility
The current accessibility implemenation does not scale and is slow. The
following implementation is memory efficient (only structure information of
pages in processing is kept in memory) and faster.
[PATCH 01/20] container element for structure tree
Adds object for storing the structure tree
[PATCH 02/20] Add methods for storing structure tree to area tree
[PATCH 03/20] Helper functions for StructureElement construction
[PATCH 04/20] Parse role attributes
[PATCH 05/20] Implement mapping functions of FO objects to structure tree
[PATCH 06/20] Store new structure tree in area tree
[PATCH 07/20] Handle structure tree only in new format
[PATCH 08/20] Remove unused code
These patches switch to the new internal format for the structure tree. Only
the structure tree of unfinished page sequences is kept in memory, so it the
overhead for documents with not too large page sequences is limited. The
content of the structure-tree tag in area-tree / intermediate XML is slightly
The page-sequences shows up as tag, so that it possible to give it roles [eg.
Part does not make sense for a one page document].
[PATCH 09/20] Workaround: Some test cases don't like ptr
I don't recommend to add this patch, as the additional ptr attributes do no
harm - but someone has to do some work on the testsuite otherwise.
[PATCH 10/20] Avoid overhead of creating writers
Adding structure information to a 100MB PDF can mean adding 300-400 MB of
structure information => Multi Mio of additional dictionaries are written =>
So many BufferedWriter are created. Getting rid of both stream representation
improves performance large document and simplifies code as there is no Writer
to forget to flush anymore.
This patch is not required for the following patches.
[PATCH 11/20] Add support for clearing objects at write time
[PATCH 12/20] Add support for lazy object number assignment
[PATCH 13/20] Improve PDFArray/Dictionary
[PATCH 14/20] Free structure tree ID map at the end of the page sequence
[PATCH 15/20] Don't write empty leaf structure to PDF
Structure information must be written in between - freeing resources after
writing reduces memory pressure. Assigning objects number latter allows to
prune empty structure elements.
[PATCH 16/20] Simplify references to the text
Use a more compact reference representation, if possible
[PATCH 17/20] Dupplicate static content structures for each pages
Static regions, which are put on multiple pages, generate strange results.
Duplicating them in the structure tree yields to more logical results.
[PATCH 18/20] Generate shorter PDF documents
Structure information means much larger PDFs - some bytes can be safed.
[PATCH 19/20] New roles for accessibility
Kept whole subtree out of structure tree.
Hide this node in the structure tree and place content directly in the parent.
Both functions are necessary to create "beautiful" structure trees. The role
names need to be discussed.
[PATCH 20/20] Support role for Flow & PageSequence
Allow changing the appearance of Flow/PageSequence in the structure tree.
Especially on small documents, the Structure Doc->Part->Sect can not be
PageSequence allows only assigning a tag name, but (because of
implementation/performance issues) it can not be removed.
Flow/static-content can be changed in every way.
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.