According to Patrick:
> My point was that if I specify a maximum document limit of 150, I
> would like htdig not to just count to 150 in terms of how many
> links it attempted to crawl, but rather, 150 full documents.
> Basically, 40 of my documents redirect to outside servers which
> are not to be crawled, but htdig STILL increments the counter.
>
> Where this seems perfectly acceptable to most, I am hoping someone
> could give me a tip/patch that would NOT increment the total_index
> counter when it encounters a 300-series redirect HTTP message; or
> even increment the server_max_docs by one just to compensate and
> not interfere with any DocumentId down the line.
>
> My goal is to crawl and record exactly the "server_max_docs"
> documents -- real documents, not uncrawlable 301/302 redirects.
Well, that certainly clarifies the problem. I'd classify this as a bug.
The Server class maintains its own private document count which it
increments whenever you do a push(), whether that push() comes from
a got_href() or a got_redirect(). You'd either need to add a Server
method that lets you decrement this counter from the Retriever class,
or add a flag argument to the push() method to tell it whether or not
to increment the counter. I don't have time right now to put together
a complete patch, but I hope this points you in the right direction.
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.