--On 27 April 2004 23:20 +0100 Karl Pielorz <[EMAIL PROTECTED]> wrote:

The only thing I can think of is maybe you have lines in url_list.txt
that are too long.  That file will be read by
ParsedString::getFileContents(), in htlib/ParsedString.cc, and it uses a
1000 character buffer to read in lines from the file.  Any lines that
are longer will be chopped in two, and it could be that the remaining
fragment is responsible for the false matches you're seeing.  If that's
the case, you can increase the buffer size to something bigger than the
largest URL you need to deal with, or rewrite the code to deal with any
size line.  (If you do the latter, we'd appreciate the patch.)

I finally got to the bottom of this - after having a merry trip around htlib/StringMatch.cc & Co.


I finally traced it down to spaces being present in some of the URL's in the ${start_url} file specified in htdig.conf

What was happening was these were being passed to StringMatch::Pattern(...) as 'fragments', e.g.

http://www.somewhere.com/some-page-you want-indexing .html

Would be added to the limits list as:

"
http://www.somewhere.com/some-page-you
want-indexing
.html
"

Thus causing _any_ url that happened to have a .html in it to match (which, lets face it is going to be a lot).

I'm surprised htDig doesn't have a separate 'definitive' list of just URL's it's allowed to "touch" - I guess this would be duplicating stuff already used & handled with 'limits'.

At least if it happens again - I know where to look...

-Karl


------------------------------------------------------- This SF.Net email is sponsored by: SourceForge.net Broadband Sign-up now for SourceForge Broadband and get the fastest 6.0/768 connection for only $19.95/mo for the first 3 months! http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click _______________________________________________ ht://Dig general mailing list: <[EMAIL PROTECTED]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general

Reply via email to