This is what we have, hope this clears up some confusion. It will show up in log files of the sites that you crawl like this. I don't know if the configuration is what is causing your problem but I have talked to other people on the list with similar problems where their configuration was incorrect. I think the only thing that is "required" is for the http.agent.name not to be blank but I would set all of the other options as well, just for politeness.
Dennis Log file will record a crawler similar to this: NameOfAgent/1.0_(Yourwebsite.com;_http://www.yoururl.com/bot.html;[EMAIL PROTECTED]) <!-- HTTP properties --> <property> <name>http.agent.name</name> <value>NameOfAgent</value> <description>Our HTTP 'User-Agent' request header.</description> </property> <property> <name>http.robots.agents</name> <value>NutchCVS,Nutch,NameOfAgent,*</value> <description>The agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence.</description> </property> <property> <name>http.robots.403.allow</name> <value>true</value> <description>Some servers return HTTP status 403 (Forbidden) if /robots.txt doesn't exist. This should probably mean that we are allowed to crawl the site nonetheless. If this is set to false, then such sites will be treated as forbidden.</description> </property> <property> <name>http.agent.description</name> <value>Yourwebsite.com</value> <description>Further description of our bot- this text is used in the User-Agent header. It appears in parenthesis after the agent name. </description> </property> <property> <name>http.agent.url</name> <value>http://yoururl.com</value> <description>A URL to advertise in the User-Agent header. This will appear in parenthesis after the agent name. </description> </property> <property> <name>http.agent.email</name> <value>[EMAIL PROTECTED]</value> <description>An email address to advertise in the HTTP 'From' request header and User-Agent header.</description> </property> <property> <name>http.agent.version</name> <value>1.0</value> <description>A version string to advertise in the User-Agent header.</description> </property> carmmello wrote: > Tanks for your answer Dennis, but, yes, I did. The only thing I did > not (and I have some doubt about it) is that in the http.agent.version > I only used Nutch-0.8.1 name, but not the the name I used in > http.robots.agent, although in this configuration I have kept the *. > Also, in the log file, I can not find any error regarding this > > ----- Original Message ----- From: "Dennis Kubes" > <[EMAIL PROTECTED]> > To: <[email protected]> > Sent: Wednesday, September 27, 2006 7:59 PM > Subject: Re: no results in nutch 0.8.1 > > >> Did you setup the user agent name in the nutch-site.xml file or >> nutch-default.xml file? >> >> Dennis >> >> carmmello wrote: >>> I have followed the steps in the 0.8.1 tutorial and, also, I have >>> been using Nutch for some time now, without seeing the kind of >>> problem I am encountering now. >>> After I have finished the crawl process (intranet crawling), I go to >>> localhost:8080, try a search and get, no matter what, 0 results. >>> Looking at the logs, everything seems ok. Also, if I use the >>> command bin/nutch readdb "crawl/crawldb" I found more than 6000 urls. >>> So, why can`t I get any results? >>> Tanks >>> >> >> >> -- >> No virus found in this incoming message. >> Checked by AVG Free Edition. >> Version: 7.1.405 / Virus Database: 268.12.9/458 - Release Date: >> 27/9/2006 >> >> > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
