Bipin Parmar wrote:
Jin,
Is it your intent to get the url list only? If it is
just one website, you can crawl the website using
nutch. Look at the "Intranet: Running the Crawl"
tutorial at
http://lucene.apache.org/nutch/tutorial8.html. Use a
very high number for depth, like 10. Once the crawl is
complete, you can extract all the urls from the
crawldb using the nutch readdb command and grep.
If you intent is to crawl every page on a website, the
same process. Just use a high depth value. Once the
crawl is complete, you can search and view the crawled
content using search pages. The tutorial describes the
search setup.
I hope I understood your question correctly.
Bipin
--- Jin Yang <[EMAIL PROTECTED]> wrote:
How to generate the urls list of a website? Should
we put 1 on 1 into
them? Like this?
www.apache.org/1.html
www.apache.org/2.html
www.apache.org/3.html
Have any tool or command can do this?
The intranet crawling don't work, what could be the problem? I use the
readdb command the check the crawldb folder, but don't have any
statistic or urls that have crawl.
I have create a file urls/nutch:
http://lucene.apache.org/nutch/
edit the conf/crawl-urlfilter.txt added: +^http://([a-z0-9]*\.)*apache.org/
set the conf/nutch-site.xml with
<property>
<name>http.agent.name</name>
<value>user agent</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.agent.description</name>
<value>nutch tutorial</value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.url</name>
<value>apache.org</value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
</description>
</property>
<property>
<name>http.agent.email</name>
<value>[EMAIL PROTECTED]</value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>
and give the command:
bin/nutch crawl urls -dir crawl -depth 10 -topN 50
and check with:
bin/nutch readdb crawl/crawldb -stats
What i wrong?