Re: [poppler] pdf to xml update

Jauco Noordzij Mon, 09 Oct 2006 01:12:35 -0700

Hi Jauco,

Sorry for the late reply, I just read through your mail and your blog
and it looks like you've done some great work. From the screenshots
your text flow analysis looks really good, and my first though was
that this will probably also be useful for text selection in pdf
viewers. The text flow analysis in TextOutputDev.cc is easily
confused, which leads to weird behavior during selection, where the
selection will jump around and suddenly include unrelated blocks of
text from across the page
( https://bugs.freedesktop.org/show_bug.cgi?id=4006).

hehehe, I'm not the fastest replyer myself... sorry about that, I had a few weeks of extreme busyness combined with extreme tiredness/lazyness. But I'm ready to get rocking again :)

I'm thinking that your text flow analysis is a bit more robust and if
we could use this as the basis for text selection too, we'd have a
much better story there.  I don't know how much time you have to work
on this now, but if you could split the text flow analysis from the
abi word xml output, that would be great.  Ideally, we keep the flow
analysis in poppler core ( i.e. in the poppler/  dir) and refactor the
code to build up a data structure that represents the text flow
(basically, just like TextOutputDev.cc does it).  Then the abiword
output tool just traverses this data structure and output the xml
document.  That way the libxml dependency also moves to the abiword
tool instead of making libpoppler depend on it.  Once that's in place,
I'd like to revisit the poppler selection code and see if I can make
it use your text flow analysis.

I'm ok with dropping the dependency, but: My code works by constructing a tree based on x,y coordinates and then interpreting this tree as a reading order list of paragraphs. The construction of the tree is done in such a way that a flattened tree will be in correct reading order. If you only want a long string of text in correct order this might be doable without constructing the tree. I would need to take a good look at how it is done now to be sure.
Without the tree there will be no way to define paragraph endings and other stuff I need for the structured text creation though. So that leaves two possibilities: Writing code to maintain a tree with attributes inside poppler or duplicating the code to the selection part and rewriting it there to create a flat list. I'm not a great fan of writing my own code to duplicate libxml functionality, I'll doubtlessly introduce new bugs and I have to serialise to xml eventually anyway.

Anyway, great you like it! I'm finishing my internship ATM and doing some other assignments but I'm determined to get this code fixed for inclusion into poppler. Getting the text selection fixed would be scratching a major itch as well. (I need to copy-paste from pdf's a _lot_ :) So let me no which direction you think is the best for poppler as a whole.

--
greetings,
Jauco Noordzij

_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

Re: [poppler] pdf to xml update

Reply via email to