Re: Some questions about Nutch

Otis Gospodnetic Sun, 17 Feb 2008 14:20:45 -0800

Oleg - just a quick pointer to adaptive refetching - is this not already 
available?
See https://issues.apache.org/jira/browse/NUTCH-61


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
> From: Oleg Mürk <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Monday, February 11, 2008 1:31:47 PM
> Subject: Some questions about Nutch
> 
> Dear Nutchers!
> 
> I would like to ask some newbie question (after reading docs for about a
> day):
> * How hard it would be to add support for adaptive refetching of pages
> depending on how often they change?
> * Is the only way to limit maximum recursion (as in wget --recursive
> --level) to iteratively generate/fetch/update segments with respectively
> 0-th, 1-st, 2-nd etc link generations? As in:
> http://wiki.apache.org/nutch/IntranetRecrawl
> * How does Nutch deal with spider traps:
> http://en.wikipedia.org/wiki/Spider_trap
>    Or breadth-first/opic just postpone long enough "infinite" sequences of
> links?
> * How one would implement a long-running server that executes
> generate/fetch/update loop? It would have to survive machine restarts etc.
> May be somebody has done it already?
> * In case I want to do something custom with fetched segments would it be a
> good idea to translate individual downloaded pages into Hbase entries (using
> map/reduce)?
>    May be somebody has done it already? Would You use Hbase to implement
> CrawlDB/LinkDB if You wrote them now?
> 
> Thank You very much for You answers!
> Oleg Mürk
>

Re: Some questions about Nutch

Reply via email to