Hi guys,

I'm new to nutch, so I have  faced a lot of unanswered questions.

1) How to implement permanent crawling? I want to make crawler running
all the time, but it's not clear for me how to manage indexes,
segments, linkdb and crawldb.
 - As I understood from the letters in mailing list I can flawlessly
 remove any segment older than refetch time. Right? But what if some
 documents weren't refetched after that time, for example, because of
 server beeing down? This document will be lost. How to prevent
 this?
 - Should I reindex all the segments all the time or just the new ones
 merging them with the previously indexed? Here another question
 arises, if the url is refetched and reindexed what will happen when
 trying to merge two indexes both containing the same url?
 - Could linkdb be incrementally updated or it's recreated every time
 from scratch?
2) Dedupe function is fine, but it is useless in finding near
duplicates.  Is it in priority list to implement near duplicates
filtering? Shingles algorithm will be great.
3) When running distributed nutch, is it possible to assign job roles
to the computers in cluster? For example, assign to several computers
_only_ crawling tasks.
4)What about removing undesirable sites from the index? For example
spammy sites etc.
5) Is it planned to implement adaptive refetch interval? Having one
value for all the sites isn't acceptable in the real world, since some
sites require frequent updates, while the others are completely static
and should be rarely recrawled.



-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to