Hi guys, I'm new to nutch, so I have faced a lot of unanswered questions.
1) How to implement permanent crawling? I want to make crawler running all the time, but it's not clear for me how to manage indexes, segments, linkdb and crawldb. - As I understood from the letters in mailing list I can flawlessly remove any segment older than refetch time. Right? But what if some documents weren't refetched after that time, for example, because of server beeing down? This document will be lost. How to prevent this? - Should I reindex all the segments all the time or just the new ones merging them with the previously indexed? Here another question arises, if the url is refetched and reindexed what will happen when trying to merge two indexes both containing the same url? - Could linkdb be incrementally updated or it's recreated every time from scratch? 2) Dedupe function is fine, but it is useless in finding near duplicates. Is it in priority list to implement near duplicates filtering? Shingles algorithm will be great. 3) When running distributed nutch, is it possible to assign job roles to the computers in cluster? For example, assign to several computers _only_ crawling tasks. 4)What about removing undesirable sites from the index? For example spammy sites etc. 5) Is it planned to implement adaptive refetch interval? Having one value for all the sites isn't acceptable in the real world, since some sites require frequent updates, while the others are completely static and should be rarely recrawled. ------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
