Hi,
crawling the webpage http://www.bmj.de, I suppose to be caught in an infinite loop. However, Nutch is fetching since two days and there seems to be no end. I need every linked document from this website. My configuration: A. The craw-urlfilter.txt: 1. I removed the line that is to break loops in case of 3+ slashes. I think, this is OK in my case and this does not cause my problem. 2. URLFilter is +^http://www.bmj.de/ 3. Command-line-option "nutch crawl .. -depth 10 -topN 10000" B. I set up the nutch-config to fetch first and to parse afterwards in order to increase fetching speed. Is it because of the session-IDs and navigation-strings in the URLs? They are like this: http://www.bmj.de/enid/3323c15e419390ec405dcc561513c2d3,1489d6706d635f696409 2d0935313835093a0979656172092d0932303038093a096d6f6e7468092d093035093a095f74 72636964092d0935313835/Pressestelle/Pressemitteilungen_58.html How can I deal with this? I´ m running Nutch/ SOLR like proposed by Doğacan Güney et. al in NUTCH-442, see https://issues.apache.org/jira/browse/NUTCH-442 with Tomcat 6 and Ubuntu 8.04. Thanks Felix.
