Some questions about Nutch

Oleg Mürk Mon, 11 Feb 2008 10:32:21 -0800

Dear Nutchers!

I would like to ask some newbie question (after reading docs for about a
day):
* How hard it would be to add support for adaptive refetching of pages
depending on how often they change?
* Is the only way to limit maximum recursion (as in wget --recursive
--level) to iteratively generate/fetch/update segments with respectively
0-th, 1-st, 2-nd etc link generations? As in:
http://wiki.apache.org/nutch/IntranetRecrawl
* How does Nutch deal with spider traps:
http://en.wikipedia.org/wiki/Spider_trap
   Or breadth-first/opic just postpone long enough "infinite" sequences of
links?
* How one would implement a long-running server that executes
generate/fetch/update loop? It would have to survive machine restarts etc.
May be somebody has done it already?
* In case I want to do something custom with fetched segments would it be a
good idea to translate individual downloaded pages into Hbase entries (using
map/reduce)?
   May be somebody has done it already? Would You use Hbase to implement
CrawlDB/LinkDB if You wrote them now?


Thank You very much for You answers!
Oleg Mürk

Some questions about Nutch

Reply via email to