On 23.10.2009 13:14:36 Vincent Hennebert wrote: > Hi, > > Just a few precisions: > > Jeremias Maerki wrote: > > On 22.10.2009 21:15:40 Simon Pepping wrote: > <snip/> > >> Can you summarize what the branch tries to achieve? > > > > I'll try. In short: it provides the Tagged PDF feature that some people > > have always wanted. > > > > Long story: Without the accessibility/document structure feature, FOP > > simply produces pages with visual content. Visually impaired people need > > tools like a screen reader to read document to them. For that the reader > > needs to know which parts of a page are important and which are not, and > > in which order the elements should be read. It needs to know that a > > sentence continues on the next page without stumbling over the page > > footer in the middle of the sentence. > > This is something that the branch doesn’t actually do yet... The > header/footer will be read at every new page, in the middle of the > sentence. > I don’t know yet how to fix that, and I’m not sure if that should be > done blindly anyway. It could be imagined that in some elaborate layouts > the side-regions have content that the author wants to be read aloud.
Actually, I believe we already do it quite nicely but that there is a bug in Acrobat's screen reader which doesn't fully rely on the document structure information, but rather reads through the tag order on each page which is not what I would expect. I was just thinking: if PDFBox could be taught to interpret the document structure information and feed the content to FreeTTS, you'd have a nice open source PDF reader. > <snip/> > > There's another side-effect to tagged PDF: It allows for better text > > extraction from the document. PDF even describes ways to make > > round-trips from XML -> PDF -> XML -> PDF if certain conditions were met. > > However, we don't do that. > > Speaking of that, the current code doesn’t insert empty elements (like > <fo:block/>) into the structure tree. The corresponding StructElem > object /is/ created, but is not linked to its parent. Actually it’s > present in the PDF without being referred to by any other object. > I think this is inconsistent, and actually wrong since that would cause > a loss of information possibly needed by a round-trip transformation. > I’m going to change that. Good catch. <snip/> Jeremias Maerki