Stefano Mazzocchi wrote: > What about using XSL:FO? Would be pretty cool to have the ability to > transform PDF into FO, basically reversing what FOP does. I know it > would be pretty painful to make it work with all kinds of > PDF, but for > reasonable ones it shouldn't be that hard (PDF is sort of a markup > language already).
It would be cool, but sadly I think the PDF format usually has too much information thrown away - there's no concept of a "flow" of text, or even a paragraph! I think SVG (or a subset of course) would be a better match than FO. In "tagged" PDF there's more information, but most PDF files have a very much simpler structure, of disconnected lines of text, positioned at particular locations on a page. I think the DTD I quoted actually covers most of what you could extract from most PDF files. :-(
