Ok I'll post it but there is no problem without eclipse. Thanks for your interest.
-----Original Message----- From: Christoph M. Pflügler [mailto:[EMAIL PROTECTED] Sent: Thursday, January 17, 2008 3:04 PM To: [email protected] Subject: RE: Eclipse-Crawl Problem I just saw that you only changed the one line in urlfilter.txt you described. So I suppose it still contains the "-." line. If so, try it without that line, this might solve your problem. Chris Am Donnerstag, den 17.01.2008, 14:20 +0200 schrieb Volkan Ebil: > Yes i know how to start crawl process.I have created the url txt file in > specifed folder.The problem occures in eclipse enviroment. > Is any body know something about my problem? > Thanks. > > -----Original Message----- > From: Christoph M. Pflügler > [mailto:[EMAIL PROTECTED] > Sent: Thursday, January 17, 2008 12:44 PM > To: [email protected] > Subject: Re: Eclipse-Crawl Problem > > Hey Volkan, > > did you specify any seed urls in an arbitrary file in the folder you pass to > nutch > with the parameter -urls? This is necessary to give nutch some point(s) > to start off with the crawl. > > > Greets, > Christoph > > Am Donnerstag, den 17.01.2008, 12:27 +0200 schrieb Volkan Ebil: > > I configured Eclipse following RunNutchInEclipse0.9 document.But when I > give > > the arguments to eclipse > > And run the Project it gives the "No URLs to fetch - check your seed list > > and URL filters". > > I have changed the line in crawl-url filter > > +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ > > With > > +. > > As it's suggested before. > > But it didn't solve my problem. > > Thanks for your help. > > > > Volkan. > > > > > >
# The url filter file used by the crawl command. # Better for intranet crawling. # Be sure to change MY.DOMAIN.NAME to your domain name. # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/.+?)/.*?\1/.*?\1/ # accept hosts in MY.DOMAIN.NAME #+^http://([a-z0-9]*\.)*sabah.com/ # accept only gov, edu, tr, mil, org, ... # # skip everything else +.
