Re: [htdig] Improving performance with lots of rarely-changing pdfdocuments and almost as many new HTML documents

Geoff Hutchison Fri, 02 Feb 2001 15:54:21 -0800
On Thu, 1 Feb 2001, Jeff wrote:

> there are thousands of documents for it to handle. From what I can
> tell, the two most resource-intensive phases are:
> 
> parsing pdf documents while htdig is running
> 
> sorting the wordlist when htmerge is running

Yes. The latter will always be memory-intensive. That's why the 3.2 code
builds databases directly--no sorting.

> It looks as though I've either overlooked an important parameter
> somewhere, or found a rough spot in htdig itself. If htdig really IS
> re-parsing every pdf document every time it's run -- even when the old

I haven't a clue why it's being re-parsed, but my guess is that the server
isn't returning a Last-Modified-Since header for comparison. I'd run htdig
-vvv and take a careful look at what happens when you index a PDF that's
already in the DB.

> 2db.docdb.work, 2db.wordlist.work, etc.), then getting htmerge to
> combine them all into one big index.

This won't fix the second problem. Granted, I could have implemented a
better sort technique for merging databases (since the individual
wordlists should be pre-sorted), but it felt like patching a tire with a
broken spoke. It's much better to get 3.2 out the door with a better
overall approach.

> Unfortunately, I've seen virtually nothing in the way of documentation
> anywhere on the site that tells how to go about doing this properly,
> besides listing the "-m" flag on the page for htmerge.

Suggestions on missing documentation, etc. are always welcome. We're not
documentation writers and this would certainly be a very appreciated
effort.

> - would I get better reindexing performance if I were to copy the
> all.* files rather than move them? The impression I've gotten from the
> site is that htmerge works from scratch every time it runs (hence the
> lack of its own -i option), so I might as well move them.

Copying is probably better. In your case, since you're not really using
the individual component databases, you can run htmerge with -d and -w
when merging and then run a final htmerge over the whole thing. Otherwise
you're doing a lot of sorting, creating db.words.db, etc.

> - Is htdig using its own sort program, or is there perhaps a better
> one to use in its place? Would specifically compiling a new copy of

It uses the system sort command. So if you have one that uses less memory
or whatever, it will be better.

> be done to keep them from trampling each other's temp files, or does
> htdig make sure it uses unique names for its tempfiles?

They are unique.

> - Is there anything in particular I should be aware of when merging
> LOTS of databases using htmerge? Perhaps small bits of corruption that
> are known to occur, that can snowball

There are two types of bugs that will receive immediate notice:
a) Data corruption issues
b) Core dumps

AFAIK there are *no* corruption issues in ht://Dig that we can reliably
trace. (I'd still love to work out these intermittent complaints, but they
seem to be hard to reproduce.)

> hundreds of conf files for each site, each one differing only with
> respect to these values)?

No, but you can use the include directive carefully, so each config file
could only be generated on-the-fly and includes the main config and then
overrides the database locations.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/



_______________________________________________
htdig-general mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/htdig-general
Re: [htdig] Improving performance with lots of rarely-changing pdfdocuments and almost as many new HTML documents

Reply via email to