On 23.10.2009 13:14:36 Vincent Hennebert wrote:
> Hi,
> Just a few precisions:
> Jeremias Maerki wrote:
> > On 22.10.2009 21:15:40 Simon Pepping wrote:
> <snip/>
> >> Can you summarize what the branch tries to achieve?
> > 
> > I'll try. In short: it provides the Tagged PDF feature that some people
> > have always wanted.
> > 
> > Long story: Without the accessibility/document structure feature, FOP
> > simply produces pages with visual content. Visually impaired people need
> > tools like a screen reader to read document to them. For that the reader
> > needs to know which parts of a page are important and which are not, and
> > in which order the elements should be read. It needs to know that a
> > sentence continues on the next page without stumbling over the page
> > footer in the middle of the sentence.
> This is something that the branch doesn’t actually do yet... The
> header/footer will be read at every new page, in the middle of the
> sentence.
> I don’t know yet how to fix that, and I’m not sure if that should be
> done blindly anyway. It could be imagined that in some elaborate layouts
> the side-regions have content that the author wants to be read aloud.

Actually, I believe we already do it quite nicely but that there is a
bug in Acrobat's screen reader which doesn't fully rely on the document
structure information, but rather reads through the tag order on each
page which is not what I would expect.

I was just thinking: if PDFBox could be taught to interpret the document
structure information and feed the content to FreeTTS, you'd have a nice
open source PDF reader.

> <snip/>
> > There's another side-effect to tagged PDF: It allows for better text
> > extraction from the document. PDF even describes ways to make
> > round-trips from XML -> PDF -> XML -> PDF if certain conditions were met.
> > However, we don't do that.
> Speaking of that, the current code doesn’t insert empty elements (like
> <fo:block/>) into the structure tree. The corresponding StructElem
> object /is/ created, but is not linked to its parent. Actually it’s
> present in the PDF without being referred to by any other object.
> I think this is inconsistent, and actually wrong since that would cause
> a loss of information possibly needed by a round-trip transformation.
> I’m going to change that.

Good catch.


Jeremias Maerki

Reply via email to