-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hello,
On 28.06.2012 19:36, Albert Astals Cid wrote: > El Dijous, 28 de juny de 2012, a les 18:54:40, Ihar `Philips` > Filipau va escriure: >> On 6/28/12, Adam Reichold <[email protected]> wrote: >>> If I remember correctly, some time ago someone proposed caching >>> the TextOuputDev/TextPage used in Poppler::Page::search to >>> improve performance. Instead, I would propose to add another >>> search method to Poppler::Page which searches the whole page at >>> once and returns a list of all occurrences. [>>snip<<] Testing >>> this with some sample files shows large improvements (above >>> 100% as measured by runtime) for searching the whole document >>> and especially for short phrases that occur often. >>> >>> Thanks for any comments and advice. Best regards, Adam. >> >> That was me. Use-case: I was checking results of conversion of >> large PDF into a e-book. >> >> PDF was 600+ pages long book: 325K words in total, 20K >> unique.(*) Problem was (and is) that there is no way to point at >> piece of text in the PDF - search was (and is) the only option. >> Conversion produced around 200 warnings - and I had to check them >> all. Meaning: 200 times searching for a group of words in 600+ >> page document. IIRC it was taking 6-7 seconds per search in the >> Okular (up-to-date version from Debian Sid). (Other PDF viewers >> haven't fared better. But the multi-word search is unique to >> Okular and was the reason why I used it exclusively.) >> >> Any speed up would have been extremely helpful. :) > > This won't help Okular at all. > > Cheers, Albert I see. Would you consider including it (if deemed technically fit) anyway? Best regards, Adam. >> >> Though the most annoying part was not the waiting time - >> checking manually 200+ warnings never going to be fast - it was >> that my CPU fan stared spinning up loudly: those 6-7 seconds were >> seconds when Okular was taking 100% CPU. >> >> (*) I have the params noted, since I was actually imagining more >> of a per-word search index for a PDF. Now looking at you patch, I >> can even calc the memory requirements. Global word index, 325K >> words, say 32 wchar_t each + int page + sizeof(rectf), is about >> 32MB - not much by the modern standards. Per unique word it is >> even less: 20K unique words, about 20 hits per word on average -> >> char word[32]; { int page; rectf rect } x 20 -> >> 32*sizeof(wchar_t) + 20*( 4 + 4*sizeof(double)) -> 784 bytes. >> That multiplied by 20K words: about 16MB. (Plus of course the >> memory allocation overhead. At this types of structures, it can >> already bite.) >> >> wbr. _______________________________________________ poppler >> mailing list [email protected] >> http://lists.freedesktop.org/mailman/listinfo/poppler > _______________________________________________ poppler mailing > list [email protected] > http://lists.freedesktop.org/mailman/listinfo/poppler > > -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJP7J9+AAoJEPSSjE3STU34OTcIALNR8Byx+m7FSskOzJED7ZRh 0xpaJjiMbaxUaxDaRRGrotqaQq6SCQ4PQ0wRDF1Oo3nFTZTwL9Ecwt839g0HnXe9 Q1GlXkxp2HB55np8CP25oweZYSta1/kLf+g+19Kruuvcyc0iqISvFik9Fax0DrSz ap8ZePemZqKMutmrRP0DQVSrqktlMV7M+V6eZccRKibkAi7FJpME0ZTD8HZ36kkt Gc9Kt/Eqt+7kWfdabN3qBQYZ/eRJHmz3cm8Br7j93XmmEYYFlWamEvIozIHXC5xO otOM6xtVHdH+tEYc1P+cVWwi6AHJi2XyBqnKfIv9Cn/lltd10Vk/NTxJRmOYjRc= =giwy -----END PGP SIGNATURE----- _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
