Jin, Is it your intent to get the url list only? If it is just one website, you can crawl the website using nutch. Look at the "Intranet: Running the Crawl" tutorial at http://lucene.apache.org/nutch/tutorial8.html. Use a very high number for depth, like 10. Once the crawl is complete, you can extract all the urls from the crawldb using the nutch readdb command and grep.
If you intent is to crawl every page on a website, the same process. Just use a high depth value. Once the crawl is complete, you can search and view the crawled content using search pages. The tutorial describes the search setup. I hope I understood your question correctly. Bipin --- Jin Yang <[EMAIL PROTECTED]> wrote: > How to generate the urls list of a website? Should > we put 1 on 1 into > them? Like this? > > www.apache.org/1.html > www.apache.org/2.html > www.apache.org/3.html > > Have any tool or command can do this? >
