Oleg - just a quick pointer to adaptive refetching - is this not already available? See https://issues.apache.org/jira/browse/NUTCH-61
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Oleg Mürk <[EMAIL PROTECTED]> > To: [email protected] > Sent: Monday, February 11, 2008 1:31:47 PM > Subject: Some questions about Nutch > > Dear Nutchers! > > I would like to ask some newbie question (after reading docs for about a > day): > * How hard it would be to add support for adaptive refetching of pages > depending on how often they change? > * Is the only way to limit maximum recursion (as in wget --recursive > --level) to iteratively generate/fetch/update segments with respectively > 0-th, 1-st, 2-nd etc link generations? As in: > http://wiki.apache.org/nutch/IntranetRecrawl > * How does Nutch deal with spider traps: > http://en.wikipedia.org/wiki/Spider_trap > Or breadth-first/opic just postpone long enough "infinite" sequences of > links? > * How one would implement a long-running server that executes > generate/fetch/update loop? It would have to survive machine restarts etc. > May be somebody has done it already? > * In case I want to do something custom with fetched segments would it be a > good idea to translate individual downloaded pages into Hbase entries (using > map/reduce)? > May be somebody has done it already? Would You use Hbase to implement > CrawlDB/LinkDB if You wrote them now? > > Thank You very much for You answers! > Oleg Mürk >
