Quick question (I hope) When I upgrade to CVS versions of Nutch, what files do I need to grab? I grabbed the latest version of Nutch in CVS, but do I need to upgrade anything in the source or what is the procedure for CVS upgrades?
Thanks a ton. Jason ----- Original Message ----- From: "Matthias Jaekle" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Tuesday, September 28, 2004 8:03 AM Subject: Re: [Nutch-general] Failing Crawls > Hi, > > crawling segments with 1/2 million pages works fine for me with a newer > nutch version from cvs. Maybe you should also get a newer one. > > > Also when I am trying to do 4 and 5 million page fetches I always come back > > the next day and the fetcher is either hung up, or it will sit idle for a > > while and then fetch a little bit and stall. Is this a memory thing or how > > do I get the fetcher to get through a complete segment? > If you have many urls from one domain your fetcher might fetch one url, > wait for some seconds, and fetch the next page from the same domain. > This is implemented to have a polite fetcher. At the end of your > fetching cycle this means, that your fetcher could rest much more than > crawling. > Maybe this information helps. > > > I am still using .05 for the code at the moment. How do I eliminate .pdf's > > so there is no chance of a PDF hanging the system up? > Use the RegexUrlFilter and configure to avoid pdfs in the regex-file in > your conf dir. > > If you use the prefixUrl filter, you should switch to a combination of > both filters. I don't know, if it is in cvs, but the one we use, you > will find at http://nutch.eventax.com/ (PrefixB4URLFilter). > > Bye > > Matthias > > > > -- > http://gmbh.eventax.de - eventax GmbH > http://www.umkreisfinder.de - Die Suchmaschine f�r Lokales und Events > http://www.fahnen-drucken.de - Flaggen einfach selbst gemacht > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170 > Project Admins to receive an Apple iPod Mini FREE for your judgement on > who ports your project to Linux PPC the best. Sponsored by IBM. > Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php > _______________________________________________ > Nutch-general mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/nutch-general ------------------------------------------------------- This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170 Project Admins to receive an Apple iPod Mini FREE for your judgement on who ports your project to Linux PPC the best. Sponsored by IBM. Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php _______________________________________________ Nutch-general mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-general
