Eugen Kochuev wrote:
Hi guys,

I'm new to nutch, so I have  faced a lot of unanswered questions.

1) How to implement permanent crawling? I want to make crawler running
all the time, but it's not clear for me how to manage indexes,
segments, linkdb and crawldb.
 - As I understood from the letters in mailing list I can flawlessly
 remove any segment older than refetch time. Right? But what if some
 documents weren't refetched after that time, for example, because of
 server beeing down? This document will be lost. How to prevent
 this?

True they will be lost for a short while, but they will be re-tried in the next fetchlist, because they are marked as "expired". Only if the number of retries is higher than a maximum, then they are declared gone. Currently pages in state GONE are never retried. This will be fixed so that they are retried after the "max. refetch interval" period.

 - Should I reindex all the segments all the time or just the new ones
 merging them with the previously indexed? Here another question
 arises, if the url is refetched and reindexed what will happen when
 trying to merge two indexes both containing the same url?

You will end up with two documents in the index. That's one of the reasons why de-duplication is necessary.

 - Could linkdb be incrementally updated or it's recreated every time
 from scratch?

As of Apr 28 it can be incrementally updated.

2) Dedupe function is fine, but it is useless in finding near
duplicates.  Is it in priority list to implement near duplicates
filtering? Shingles algorithm will be great.

Please see the Signature API, and specifically the TextProfileSignature.

3) When running distributed nutch, is it possible to assign job roles
to the computers in cluster? For example, assign to several computers
_only_ crawling tasks.

Not yet. You would need to build two clusters for that (or run some of the tasktrackers from each cluster on the same machines, just using different ports).

4)What about removing undesirable sites from the index? For example
spammy sites etc.

This is the job of URLFilters, IndexingFilters and ScoringFilters - at any of these points you can decide to remove certain sites.


5) Is it planned to implement adaptive refetch interval? Having one
value for all the sites isn't acceptable in the real world, since some
sites require frequent updates, while the others are completely static
and should be rarely recrawled.


It's in the final preparation steps before committing.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to