According to Geoff Silver:
> I've been using htdig for a little while, and I've recently been alerted
> to an indexing "issue", which I'm hoping someone might be able to help.
> We have a list of about 800 sites we need to index.  If I run a small
> subset (10 or 20 sites), they index fine.  However, when I index the full
> 800, I find that htdig no longer stays on the site - that is, it seems to
> crawl off-site links as well (which is definitely a problem for us).
> 
> I have "limit_urls_to: ${start_url}" set in both my htdig.conf and a
> seperate scitechdb.conf (science & technology database) file.  I'm
> actually using a multidig configuration (we index a few other small sites
> on the same server as different databases), which otherwise works well.
> 
> I'm wondering if there is an issue with indexing large amounts of data - a

That's wierd.  We had problems like this in the older 3.2 betas, where the
limit_urls_to pattern got crammed into a very large regular expression,
which failed when the expression got too large.  The 3.1 code, on the
other hand, uses the StringMatch class to handle limit_urls_to, and I
don't know of any problems with really large patterns in StringMatch.
Indeed, it's supposed to allocate a pattern table big enough to handle
the worst case scenario for the size of string it's given.  Still,
I suppose it's not impossible that it chokes on really big patterns.
Can you find out what the breaking point is, after which it stops limiting
htdig to the list of URLs you want?

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to