Uhuh, yes, this is most likely due to session IDs that create unique URLs that Nutch keeps processing. Look at conf/regex-normalize.xml for how you can clean up URLs. That should help.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Felix Zimmermann <[EMAIL PROTECTED]> > To: [email protected] > Sent: Monday, June 16, 2008 8:46:29 AM > Subject: infinite loop-problem > > Hi, > > > > crawling the webpage http://www.bmj.de, I suppose to be caught in an > infinite loop. However, Nutch is fetching since two days and there seems to > be no end. > > > > I need every linked document from this website. > > > > My configuration: > > > > A. The craw-urlfilter.txt: > > > > 1. I removed the line that is to break loops in case of 3+ slashes. I think, > this is OK in my case and this does not cause my problem. > > 2. URLFilter is +^http://www.bmj.de/ > > 3. Command-line-option "nutch crawl .. -depth 10 -topN 10000" > > > > B. I set up the nutch-config to fetch first and to parse afterwards in order > to increase fetching speed. > > > > > > Is it because of the session-IDs and navigation-strings in the URLs? They > are like this: > > http://www.bmj.de/enid/3323c15e419390ec405dcc561513c2d3,1489d6706d635f696409 > 2d0935313835093a0979656172092d0932303038093a096d6f6e7468092d093035093a095f74 > 72636964092d0935313835/Pressestelle/Pressemitteilungen_58.html > > > > > > How can I deal with this? > > > > I´ m running Nutch/ SOLR like proposed by Doğacan Güney et. al in NUTCH-442, > see https://issues.apache.org/jira/browse/NUTCH-442 with Tomcat 6 and Ubuntu > 8.04. > > > > > > Thanks > > Felix.
