Nutch search result

2011-02-18 Thread Thomas Anderson
I follow the NutchTutorial and get the search worked, but I have several questions. 1st, is it possible for a website to setup some restriction so that nutch can not fetch its pages or the pages fetched is limited under some condition? If so, what file (e.g. robots.txt?) nutch would respect in

RE: Nutch search result

2011-02-18 Thread McGibbney, Lewis John
Hi Thomas Firstly which dist are you using? ___ From: Thomas Anderson [t.dt.aander...@gmail.com] Sent: 18 February 2011 10:11 To: user@nutch.apache.org Subject: Nutch search result I follow the NutchTutorial and get the search worked, but I have several

What is the end point of a pure crawl?

2011-02-18 Thread Jeff Zhou
Hi, I want to separate parsing from crawling in Nutch. In other words, I want to crawl thousands of URLs and save the contents in local drive, and parse the contents later after crawling is completed. What is the end point (Java class, line of code, etc.) for the crawling? Thanks, Jeff

Why some links aren't fetched?

2011-02-18 Thread Jeff Zhou
Hi, When I look through the fetched results, I find some URLs were fetched and some weren't. How can I make sure that every URL is fetched? Thanks, Jeff

Re: What is the end point of a pure crawl?

2011-02-18 Thread Markus Jelsma
I'm not sure what you mean but generating a segment and just fetching it (possibly with the -noParse option, depends your config) will just download the URL's into the segment. On Friday 18 February 2011 14:40:37 Jeff Zhou wrote: Hi, I want to separate parsing from crawling in Nutch. In

Re: Why some links aren't fetched?

2011-02-18 Thread Markus Jelsma
Thats difficult. First of all your url-filter may prevent some URL's from being fetched. It can also happen that your parsers won't process all fetched content types. There may also be network issues, 404, 30x and 50x. On Friday 18 February 2011 14:47:38 Jeff Zhou wrote: Hi, When I look

Re: Nutch search result

2011-02-18 Thread Thomas Anderson
The version used is nutch 1.1. OS is debian testing. Java version is 1.6.0_23. The first question raises from when testing to fetch plurk.com. The url specified at the inject stage only contains e.g. http://plurk.com. After going through the steps described in the tutorial, I notice no `fetching

Re: Nutch search result

2011-02-18 Thread alxsss
2nd, after testing to fetch several pages from wikipedia, the search query with e.g. bin/nutch org.apache.nutch.searcher.NutchBean apache ../wiki_dir returns It returns a result for keyword apache because that url has apache in it. -topN 50), it actually fetches some pages e.g. `fetching