Bug#1061423: xpdf: performance of a second text search should be improved

Vincent Lefevre Thu, 25 Jan 2024 01:48:12 -0800

Hi Adam,

On 2024-01-24 18:08:25 +0000, Adam Sampson wrote:
> On Wed, Jan 24, 2024 at 11:30:01AM +0100, Vincent Lefevre wrote:
> > With zathura, the first search needs the same time as xpdf, but a
> > second search is much faster (almost immediate). xpdf text search
> > should be improved to be as fast as zathura.
> 
> Looking at the code, Zathura uses poppler-glib, which implements search
> in the same way as xpopple -- it renders each page to a TextPage object
> and searches the text strings in that. However, poppler-glib caches the
> TextPages once they've been rendered, whereas xpopple renders them again
> for each new search.
> 
> I've just pushed a commit to xpopple git to add a similar cache, which
> seems to have the desired effect. It'll use a bit more memory this way
> but I don't expect it'll be a concern relative to the size of the rest
> of the PDF data...


Thanks. On PDF files with many pages (e.g., large books), it can
actually take much more memory, but I think that this is beneficial
in general. On such files, this is where a text search would take
time without the cache.

For instance, for the POSIX spec (3952 pages), which is quite an
extreme case, the RSS memory increases from 55 MB to 3.4 GB after a
search across the full document. For the Debian handbook (498 pages),
which is a rather large book, the RSS memory increases from 37 MB to
430 MB (which is less than what Firefox takes on my machine).

I think that if memory is an issue for some users, the cache could be
made optional. Or perhaps poppler-glib could be improved to provide
a cache just for text (for the POSIX spec, the full text itself just
takes about 10 MB).

-- 
Vincent Lefèvre <vinc...@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Bug#1061423: xpdf: performance of a second text search should be improved

Reply via email to