Excerpts from mpsuzuki's message of mar sep 07 08:42:31 +0200 2010: > Hi, > > I want to ask some questions about the internal design of > poppler-glib. > > ---------------------------------------------------------- > > Recently Albert accepted my proposal to extend the interface > of TextOutputDev to access the raw text (the layout/position > info is not considered). At present, only poppler-qt4 could > use the extented API, but I don't want to restrict it to > poppler-qt4. I'm trying to extend poppler-glib (and poppler-cpp > in next) to use the extended API. > > Checking the internal code how to extract the text from PDF, > there is a difference between poppler-qt4 and poppler-glib. > Adding a few new APIs to enable/disable raw-order mode is > insufficient for poppler-glib to access raw text. > > poppler-qt4 > ----------- > To get the text content from page object, Poppler::Page::text() > is invoked. > > In Poppler::Page::text(), TextOutputDev is created, > TextOutputDev::displayPageSlice() is invoked with selection area, > and TextOutputDev::getText() is invoked and GooString is obtained. > Finally, GooString is converted to QString object and returned > to the client. > > poppler-glib > ------------ > To get the text content from page object, > TextOutputDev::getSelectionText() is used. > > It dumps the strings collected by TextSelectionVisitor > object. TextSelectionVisitor define 3 methods to eat the text, > visitBlock(), visitLine() and visitWord(). But only visitLine() > method is implemented. Because "line" is defined by the > analysis of the text layout, there is no lines in raw order. >
Why not simply use TextOutputDev::getText() like qt4 frontend does? TextOutputDev::getSelectionText() is meant for selections, but you don't want text in raw order for selections. I would just add a new method gchar *poppler_page_get_raw_text (PopplerPage *page); Regards, -- Carlos Garcia Campos PGP key: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x523E6462
signature.asc
Description: PGP signature
_______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
