Dear Nutchers! I would like to ask some newbie question (after reading docs for about a day): * How hard it would be to add support for adaptive refetching of pages depending on how often they change? * Is the only way to limit maximum recursion (as in wget --recursive --level) to iteratively generate/fetch/update segments with respectively 0-th, 1-st, 2-nd etc link generations? As in: http://wiki.apache.org/nutch/IntranetRecrawl * How does Nutch deal with spider traps: http://en.wikipedia.org/wiki/Spider_trap Or breadth-first/opic just postpone long enough "infinite" sequences of links? * How one would implement a long-running server that executes generate/fetch/update loop? It would have to survive machine restarts etc. May be somebody has done it already? * In case I want to do something custom with fetched segments would it be a good idea to translate individual downloaded pages into Hbase entries (using map/reduce)? May be somebody has done it already? Would You use Hbase to implement CrawlDB/LinkDB if You wrote them now?
Thank You very much for You answers! Oleg Mürk
