Doubts on Crawl command and seed urls

Kim Theng Chong Mon, 29 Mar 2010 18:12:33 -0700

Hi all,

I am new to Nutch and I have some doubts regarding to "crawl" command and seed 
urls.


When I run crawl command for second time (my "/crawl" folder does exist for the 
first time running which is run on different day), the segments generated seems 
don't have the updated information from the seed urls (I had run search on 
it), but when I delete the "/crawl" folder and run the crawl command again, the 
segments generated were included the updated information from the url seeds. I 
had study many documents on Nutch but I stil not able to understand how this 
crawl command works. Can someone help me on this? I will be appreciate it.

Besides this, I would like to ask the seed urls that we provided is it must be 
the root of the web page (eg: www.mypage.com), can I put something like 
www.mypage.com/comment ? Because when I put the seed urls which is not the root 
page and run the crawl command, many of the link in www.mypage.com/comment were 
not fetched (I checked it from log file) and when I search the information 
which I sure that it is on the page (www.mypage.com/comment ), it will return 
zero result.

Actually I am running nutch 1.0 in eclipse environment and i did edit the 
crawl-urlfilter.txt to following  
# skip URLs containing certain characters as probable queries, etc.
-...@]
 # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/    (i comment this out)

Can someone give me some guidance on this?

Thank you so much.

Warm regards,
Kim

Doubts on Crawl command and seed urls

Reply via email to