Hi S�bastien: Thank you for you reply. I've also tried that way, but more or less some pages will be crwaled that I don't need, I wanna focus on http://www.cs.princeton.edu/~, and anything except that is not need. Also, I took princeton for example, but I am also crawling other univs., like http://theory.lcs.mit.edu/~, so it is a general case, I wanna find out a general solution. Thank you.
On Apr 7, 2005 2:52 PM, S�bastien LE CALLONNEC <[EMAIL PROTECTED]> wrote: > Hi Eric, > > I might be wrong (and people here will answer properly if I am), but > the crawl-urlfilter file contains the urls _to be crawled_, not _to be > indexed_. A solution might be to keep your regexp, to add one to allow > the /people/ pages to be crawled, and to start the crawling at those > links: > > http://www.cs.princeton.edu/people/grad.php > http://www.cs.princeton.edu/people/ugrad.php > http://www.cs.princeton.edu/people/techstaff.php > > and any other that might contain links to personal pages. > > Hope this helps. > > Regards, > S�bastien. > > > --- Eric Money <[EMAIL PROTECTED]> wrote: > > I have problem setting up the urlfilter. For example, > > I wanna crawl all student pages at http://www.cs.princeton.edu/ > > which ends up with http://www.cs.princeton/edu/~abcd > > sth like that. Thus I made the starting page > > http://www.cs.princeton.edu/ > > and set up the crawl-urlfilter as > > > > +^http://www.cs.princeton.edu/~([a-z0-9]*\.//)* > > > > But it just doesn't crawl anything, > > if I remove the "~", it does crawl well, but also crawl > > many things that I don't need, like ../course/...., > > how should I set up the urlfilter properly? Thank you all. > > > > __________________________________________________________________ > D�couvrez le nouveau Yahoo! Mail : 250 Mo d'espace de stockage pour vos mails > ! > Cr�ez votre Yahoo! Mail sur http://fr.mail.yahoo.com/ >
