Hi Kelvin:
1) bot-traps problem for OC
If we have a crawling depth for each starting host, it
seems that the crawling will be finalized in the end (
we can decrement depth value in each time the outlink
falls in same host domain).
Let me know if my thought is wrong.
2) refetching
If OC's fetchlist is online (memory residence), the
next time refetch we have to restart from seeds.txt
once again. Is it right?
3) page content checking
In OC API, I found an API WebDBContentSeenFilter, who
uses Nutch webdb data structure to see if the fetched
page content has been seen before. That means, we have
to use Nutch to create a webdb (maybe nutch/updatedb)
in order to support this function. Is it right?
thanks,
Michael,
____________________________________________________
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs