Re: [htdig] Digging the world / 'limit_urls_to:' ignored?

Gilles Detillieux Tue, 27 Apr 2004 12:57:34 -0700

According to Karl Pielorz:
> We're running htdig 3.1.6 under FreeBSD 4.9. We index a number of local 
> sites, which we do by using:
> 
> "
> start_url:            `/usr/local/htdig/url_list.txt`
> "
> 
> and,
> 
> "
> limit_urls_to:        ${start_url}
> "
...
> But then, we see stuff like:
> 
> "
> A tag: pos = 16, position = ="http://u.extreme-dm.com/?login=zzq9w8sak";>
> 
>    pushing http://u.extreme-dm.com/?login=zzq9w8sak
...
> So, obviously that login isn't valid any more - but why did htdig try to 
> fetch the site? - It's not in our url_list.txt - and it shouldn't appear in 
> the ${start_url} either should it?


The only thing I can think of is maybe you have lines in url_list.txt that
are too long.  That file will be read by ParsedString::getFileContents(),
in htlib/ParsedString.cc, and it uses a 1000 character buffer to read in
lines from the file.  Any lines that are longer will be chopped in two,
and it could be that the remaining fragment is responsible for the false
matches you're seeing.  If that's the case, you can increase the buffer
size to something bigger than the largest URL you need to deal with,
or rewrite the code to deal with any size line.  (If you do the latter,
we'd appreciate the patch.)

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This SF.net email is sponsored by: The Robotic Monkeys at ThinkGeek
For a limited time only, get FREE Ground shipping on all orders of $35
or more. Hurry up and shop folks, this offer expires April 30th!
http://www.thinkgeek.com/freeshipping/?cpg=12297
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

Re: [htdig] Digging the world / 'limit_urls_to:' ignored?

Reply via email to