--On 27 April 2004 14:46 -0500 Gilles Detillieux <[EMAIL PROTECTED]> wrote:


" limit_urls_to: ${start_url} "
...
But then, we see stuff like:

"
A tag: pos = 16, position = ="http://u.extreme-dm.com/?login=zzq9w8sak";>

pushing http://u.extreme-dm.com/?login=zzq9w8sak
...
So, obviously that login isn't valid any more - but why did htdig try to
fetch the site? - It's not in our url_list.txt - and it shouldn't appear
in  the ${start_url} either should it?

The only thing I can think of is maybe you have lines in url_list.txt that are too long. That file will be read by ParsedString::getFileContents(), in htlib/ParsedString.cc, and it uses a 1000 character buffer to read in lines from the file. Any lines that are longer will be chopped in two, and it could be that the remaining fragment is responsible for the false matches you're seeing. If that's the case, you can increase the buffer size to something bigger than the largest URL you need to deal with, or rewrite the code to deal with any size line. (If you do the latter, we'd appreciate the patch.)

Thanks for the reply :)


I've double checked our url_list.txt file - nothings over about 60-70 characters... Trying to do a 'partial' match (at least in vi) shows nothing similar in the file either [as I've read the ${start_url} is only a partial match thing apparently :)]

The URL it starts to wander off and index is nothing like the ones in our url_list.txt (and hence, shouldn't be anything like the one(s) in ${start_url}

Thanks for the pointer of where to start looking though - I'll have to see if my limited c++ skills can carry me a bit further down the "why the hell?" chain :-)

Cheers,

-Karl


-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE. http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

Reply via email to