Uhuh, yes, this is most likely due to session IDs that create unique URLs that 
Nutch keeps processing.
Look at conf/regex-normalize.xml for how you can clean up URLs.  That should 
help.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: Felix Zimmermann <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Monday, June 16, 2008 8:46:29 AM
> Subject: infinite loop-problem
> 
> Hi,
> 
> 
> 
> crawling the webpage http://www.bmj.de, I suppose to be caught in an
> infinite loop. However, Nutch is fetching since two days and there seems to
> be no end.
> 
> 
> 
> I need every linked document from this website.
> 
> 
> 
> My configuration:
> 
> 
> 
> A. The craw-urlfilter.txt:
> 
> 
> 
> 1. I removed the line that is to break loops in case of 3+ slashes. I think,
> this is OK in my case and this does not cause my problem.
> 
> 2. URLFilter is +^http://www.bmj.de/
> 
> 3. Command-line-option "nutch crawl .. -depth 10 -topN 10000"
> 
> 
> 
> B. I set up the nutch-config to fetch first and to parse afterwards in order
> to increase fetching speed.
> 
> 
> 
> 
> 
> Is it because of the session-IDs and navigation-strings in the URLs? They
> are like this:
> 
> http://www.bmj.de/enid/3323c15e419390ec405dcc561513c2d3,1489d6706d635f696409
> 2d0935313835093a0979656172092d0932303038093a096d6f6e7468092d093035093a095f74
> 72636964092d0935313835/Pressestelle/Pressemitteilungen_58.html
> 
> 
> 
> 
> 
> How can I deal with this?
> 
> 
> 
> I´ m running Nutch/ SOLR like proposed by Doğacan Güney et. al in NUTCH-442,
> see https://issues.apache.org/jira/browse/NUTCH-442 with Tomcat 6 and Ubuntu
> 8.04. 
> 
> 
> 
> 
> 
> Thanks
> 
> Felix.

Reply via email to