Jauco Noordzij <jauco <at> jauco.nl> writes: > > > > Hi Jauco,Sorry for the late reply, I just read through your mail and your blog > > and it looks like you've done some great work. From the screenshotsyour text flow analysis looks really good, and my first though wasthat this will probably also be useful for text selection in pdfviewers. The text flow analysis in > TextOutputDev.cc is easilyconfused, which leads to weird behavior during selection, where theselection will jump around and suddenly include unrelated blocks oftext from across the page( > > > https://bugs.freedesktop.org/show_bug.cgi?id=4006). > hehehe, > I'm not the fastest replyer myself... sorry about that, I had a few > weeks of extreme busyness combined with extreme tiredness/lazyness. But > I'm ready to get rocking again :) > > > I'm thinking that your text flow analysis is a bit more robust and if > we could use this as the basis for text selection too, we'd have a > much better story there. I don't know how much time you have to workon this now, but if you could split the text flow analysis from theabi word xml output, that would be great. Ideally, we keep the flow > > analysis in poppler core ( > i.e. in the poppler/ dir) and refactor thecode to build up a data structure that represents the text flow(basically, just like TextOutputDev.cc does it). Then the abiwordoutput tool just traverses this data structure and output the xml > document. That way the libxml dependency also moves to the abiwordtool instead of making libpoppler depend on it. Once that's in place,I'd like to revisit the poppler selection code and see if I can make > > > it use your text flow analysis. > I'm > ok with dropping the dependency, but: My code works by constructing a > tree based on x,y coordinates and then interpreting this tree as a > reading order list of paragraphs. The construction of the tree is done > in such a way that a flattened tree will be in correct reading order. > If you only want a long string of text in correct order this might be > doable without constructing the tree. I would need to take a good look > at how it is done now to be sure. > Without the tree there will be no way to define paragraph endings > and other stuff I need for the structured text creation though. So that > leaves two possibilities: Writing code to maintain a tree with > attributes inside poppler or duplicating the code to the selection part > and rewriting it there to create a flat list. I'm not a great fan of > writing my own code to duplicate libxml functionality, I'll doubtlessly > introduce new bugs and I have to serialise to xml eventually anyway. Anyway, great you like it! I'm finishing my internship ATM > and doing some other assignments but I'm determined to get this code > fixed for inclusion into poppler. Getting the text selection fixed > would be scratching a major itch as well. (I need to copy-paste from > pdf's a _lot_ :) So let me no which direction you think is the best for > poppler as a whole. > > > > -- greetings, Jauco Noordzij > > > > > > _______________________________________________ > poppler mailing list > poppler <at> lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/poppler >
Hi All I also have the same problem i need to separate the text from different paragraphs, so far I modify the code in GFX, I create paragraphs with all the text (text is defined by commands TJ and Tj) between BT and ET, so far the code works very well, as i said before its deterministic so we can always be sure that we get the correct paragraph, BUT I have a problem, when the code uses unicode characters then I cannot read the text :-( Im tracing the code to see if I can change have access to the unicode characters in 8 bits, but im having problems in that part, if somebody can please tell me if there is a way to translate the unicode characters to plain english I can share my code to extract paragraphs, thanks _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
