I follow the NutchTutorial and get the search worked, but I have
several questions.
1st, is it possible for a website to setup some restriction so that
nutch can not fetch its pages or the pages fetched is limited under
some condition? If so, what file (e.g. robots.txt?) nutch would
respect in
Hi Thomas
Firstly which dist are you using?
___
From: Thomas Anderson [t.dt.aander...@gmail.com]
Sent: 18 February 2011 10:11
To: user@nutch.apache.org
Subject: Nutch search result
I follow the NutchTutorial and get the search worked, but I have
several
Hi,
I want to separate parsing from crawling in Nutch. In other words, I want to
crawl thousands of URLs and save the contents in local drive, and parse the
contents later after crawling is completed.
What is the end point (Java class, line of code, etc.) for the crawling?
Thanks,
Jeff
Hi,
When I look through the fetched results, I find some URLs were fetched and
some weren't. How can I make sure that every URL is fetched?
Thanks,
Jeff
I'm not sure what you mean but generating a segment and just fetching it
(possibly with the -noParse option, depends your config) will just download the
URL's into the segment.
On Friday 18 February 2011 14:40:37 Jeff Zhou wrote:
Hi,
I want to separate parsing from crawling in Nutch. In
Thats difficult. First of all your url-filter may prevent some URL's from being
fetched. It can also happen that your parsers won't process all fetched
content types. There may also be network issues, 404, 30x and 50x.
On Friday 18 February 2011 14:47:38 Jeff Zhou wrote:
Hi,
When I look
The version used is nutch 1.1. OS is debian testing. Java version is 1.6.0_23.
The first question raises from when testing to fetch plurk.com. The
url specified at the inject stage only contains e.g. http://plurk.com.
After going through the steps described in the tutorial, I notice no
`fetching
2nd, after testing to fetch several pages from wikipedia, the search
query with e.g. bin/nutch org.apache.nutch.searcher.NutchBean apache
../wiki_dir returns
It returns a result for keyword apache because that url has apache in it.
-topN 50), it actually fetches some pages e.g. `fetching
8 matches
Mail list logo