According to Patrick:
> My point was that if I specify a maximum document limit of 150, I
> would like htdig not to just count to 150 in terms of how many
> links it attempted to crawl, but rather, 150 full documents.
> Basically, 40 of my documents redirect to outside servers which
> are not to be crawled, but htdig STILL increments the counter.
> 
> Where this seems perfectly acceptable to most, I am hoping someone
> could give me a tip/patch that would NOT increment the total_index
> counter when it encounters a 300-series redirect HTTP message; or
> even increment the server_max_docs by one just to compensate and
> not interfere with any DocumentId down the line.
> 
> My goal is to crawl and record exactly the "server_max_docs"
> documents -- real documents, not uncrawlable 301/302 redirects.

Well, that certainly clarifies the problem.  I'd classify this as a bug.
The Server class maintains its own private document count which it
increments whenever you do a push(), whether that push() comes from
a got_href() or a got_redirect().  You'd either need to add a Server
method that lets you decrement this counter from the Retriever class,
or add a flag argument to the push() method to tell it whether or not
to increment the counter.  I don't have time right now to put together
a complete patch, but I hope this points you in the right direction.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this. 


Reply via email to