Re: How to crawl every urls on a website?

Jin Yang Tue, 12 Sep 2006 09:22:06 -0700

Bipin Parmar wrote:

Jin,


Is it your intent to get the url list only? If it is
just one website, you can crawl the website using
nutch. Look at the "Intranet: Running the Crawl"
tutorial at
http://lucene.apache.org/nutch/tutorial8.html. Use a
very high number for depth, like 10. Once the crawl is
complete, you can extract all the urls from the
crawldb using the nutch readdb command and grep.

If you intent is to crawl every page on a website, the
same process. Just use a high depth value. Once the
crawl is complete, you can search and view the crawled
content using search pages. The tutorial describes the
search setup.

I hope I understood your question correctly.

Bipin

--- Jin Yang <[EMAIL PROTECTED]> wrote:

How to generate the urls list of a website? Should

we put 1 on 1 intothem? Like this?


www.apache.org/1.html
www.apache.org/2.html
www.apache.org/3.html

Have any tool or command can do this?

The intranet crawling don't work, what could be the problem? I use thereaddb command the check the crawldb folder, but don't have anystatistic or urls that have crawl.


I have create a file urls/nutch:

http://lucene.apache.org/nutch/

edit the conf/crawl-urlfilter.txt added: +^http://([a-z0-9]*\.)*apache.org/
set the conf/nutch-site.xml with

<property>
 <name>http.agent.name</name>
 <value>user agent</value>

<description>HTTP 'User-Agent' request header. MUST NOT be empty -please set this to a single word uniquely related to your organization.


 NOTE: You should also check other related properties:

        http.robots.agents
        http.agent.description
        http.agent.url
        http.agent.email
        http.agent.version

 and set their values appropriately.

 </description>
</property>

<property>
 <name>http.agent.description</name>
 <value>nutch tutorial</value>
 <description>Further description of our bot- this text is used in
 the User-Agent header.  It appears in parenthesis after the agent name.
 </description>
</property>

<property>
 <name>http.agent.url</name>
 <value>apache.org</value>

<description>A URL to advertise in the User-Agent header. This willappear in parenthesis after the agent name. Custom dictates that this

  should be a URL of a page explaining the purpose and behavior of this
  crawler.
 </description>
</property>

<property>
 <name>http.agent.email</name>
 <value>[EMAIL PROTECTED]</value>
 <description>An email address to advertise in the HTTP 'From' request
  header and User-Agent header. A good practice is to mangle this
  address (e.g. 'info at example dot com') to avoid spamming.
 </description>
</property>

and give the command:

bin/nutch crawl urls -dir crawl -depth 10 -topN 50

and check with:

bin/nutch readdb crawl/crawldb -stats

What i wrong?

Re: How to crawl every urls on a website?

Reply via email to