Hi all, I am new to Nutch and I have some doubts regarding to "crawl" command and seed urls.
When I run crawl command for second time (my "/crawl" folder does exist for the first time running which is run on different day), the segments generated seems don't have the updated information from the seed urls (I had run search on it), but when I delete the "/crawl" folder and run the crawl command again, the segments generated were included the updated information from the url seeds. I had study many documents on Nutch but I stil not able to understand how this crawl command works. Can someone help me on this? I will be appreciate it. Besides this, I would like to ask the seed urls that we provided is it must be the root of the web page (eg: www.mypage.com), can I put something like www.mypage.com/comment ? Because when I put the seed urls which is not the root page and run the crawl command, many of the link in www.mypage.com/comment were not fetched (I checked it from log file) and when I search the information which I sure that it is on the page (www.mypage.com/comment ), it will return zero result. Actually I am running nutch 1.0 in eclipse environment and i did edit the crawl-urlfilter.txt to following # skip URLs containing certain characters as probable queries, etc. -...@] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops #-.*(/[^/]+)/[^/]+\1/[^/]+\1/ (i comment this out) Can someone give me some guidance on this? Thank you so much. Warm regards, Kim