RE: Eclipse-Crawl Problem

Volkan Ebil Thu, 17 Jan 2008 05:12:38 -0800

Ok I'll post it but there is no problem without eclipse.
Thanks for your interest.


-----Original Message-----
From: Christoph M. Pflügler
[mailto:[EMAIL PROTECTED] 
Sent: Thursday, January 17, 2008 3:04 PM
To: [email protected]
Subject: RE: Eclipse-Crawl Problem

I just saw that you only changed the one line in urlfilter.txt you
described.

So I suppose it still contains the "-." line. If so, try it without that
line, this might solve your problem.

Chris

Am Donnerstag, den 17.01.2008, 14:20 +0200 schrieb Volkan Ebil:
> Yes i know how to start crawl process.I have created the url txt file in
> specifed folder.The problem occures in eclipse enviroment.
> Is any body know something about my problem?
> Thanks.
> 
> -----Original Message-----
> From: Christoph M. Pflügler
> [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, January 17, 2008 12:44 PM
> To: [email protected]
> Subject: Re: Eclipse-Crawl Problem
> 
> Hey Volkan,
> 
> did you specify any seed urls in an arbitrary file in the folder you pass
to
> nutch
> with the parameter -urls? This is necessary to give nutch some point(s)
> to start off with the crawl.
> 
> 
> Greets,
> Christoph
>  
> Am Donnerstag, den 17.01.2008, 12:27 +0200 schrieb Volkan Ebil:
> > I configured Eclipse following RunNutchInEclipse0.9 document.But when I
> give
> > the arguments to eclipse
> > And run the Project it gives the "No URLs to fetch - check your seed
list
> > and URL filters".
> > I have changed the line in crawl-url filter 
> > +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> > With
> > +.
> > As it's suggested before.
> > But it didn't solve my problem.
> > Thanks for your help.
> >  
> > Volkan.
> > 
> >  
> >

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*sabah.com/


# accept only gov, edu, tr, mil, org, ...
#

# skip everything else
+.

RE: Eclipse-Crawl Problem

Reply via email to