Hi S�bastien:

Thank you for you reply. I've also tried that way, but more or less
some pages will be crwaled that I don't need, I wanna focus on
http://www.cs.princeton.edu/~, and anything except that is not need.
Also, I took princeton for example, but I am also crawling other univs.,
like http://theory.lcs.mit.edu/~, so it is a general case, I wanna find out
a general solution. Thank you.

On Apr 7, 2005 2:52 PM, S�bastien LE CALLONNEC <[EMAIL PROTECTED]> wrote:
> Hi Eric,
> 
> I might be wrong (and people here will answer properly if I am), but
> the crawl-urlfilter file contains the urls _to be crawled_, not _to be
> indexed_.  A solution might be to keep your regexp, to add one to allow
> the /people/ pages to be crawled, and to start the crawling at those
> links:
> 
> http://www.cs.princeton.edu/people/grad.php
> http://www.cs.princeton.edu/people/ugrad.php
> http://www.cs.princeton.edu/people/techstaff.php
> 
> and any other that might contain links to personal pages.
> 
> Hope this helps.
> 
> Regards,
> S�bastien.
> 
> 
> --- Eric Money <[EMAIL PROTECTED]> wrote:
> > I have problem setting up the urlfilter. For example,
> > I wanna crawl all student pages at http://www.cs.princeton.edu/
> > which ends up with http://www.cs.princeton/edu/~abcd
> > sth like that. Thus I made the starting page
> > http://www.cs.princeton.edu/
> > and set up the crawl-urlfilter as
> >
> > +^http://www.cs.princeton.edu/~([a-z0-9]*\.//)*
> >
> > But it just doesn't crawl anything,
> > if I remove the "~", it does crawl well, but also crawl
> > many things that I don't need, like ../course/....,
> > how should I set up the urlfilter properly? Thank you all.
> >
> 
> __________________________________________________________________
> D�couvrez le nouveau Yahoo! Mail : 250 Mo d'espace de stockage pour vos mails 
> !
> Cr�ez votre Yahoo! Mail sur http://fr.mail.yahoo.com/
>

Reply via email to