Hi, Yes I tried both. It just does not crawl anything. Also I tried http://www.cs.princeton.edu/~*, it works the same as http://www.cs.princeton.edu, so wierd.
On Apr 7, 2005 3:10 PM, Hauck, William B. <[EMAIL PROTECTED]> wrote: > Eric, > > Have you tried with _only_ the ~ instead of the whole ending? ... > Turn this: > +^http://www.cs.princeton.edu/~([a-z0-9]*\.//)* > Into this: > +^http://www.cs.princeton.edu/~ > > Also, you might need to escape the ~ ... > +^http://www.cs.princeton.edu/\~ > > If it works, please reply back to the list with the answer. > > Good luck, > > bill > > -----Original Message----- > From: Eric Money [mailto:[EMAIL PROTECTED] > Sent: Thursday, April 07, 2005 2:59 PM > To: S�bastien LE CALLONNEC > Cc: [email protected] > Subject: Re: Crawl-urlfiter problem > > Hi S�bastien: > > Thank you for you reply. I've also tried that way, but more or less some > pages will be crwaled that I don't need, I wanna focus on > http://www.cs.princeton.edu/~, and anything except that is not need. > Also, I took princeton for example, but I am also crawling other univs., like > http://theory.lcs.mit.edu/~, so it is a general case, I wanna find out a > general solution. Thank you. > > On Apr 7, 2005 2:52 PM, S�bastien LE CALLONNEC <[EMAIL PROTECTED]> wrote: > > Hi Eric, > > > > I might be wrong (and people here will answer properly if I am), but > > the crawl-urlfilter file contains the urls _to be crawled_, not _to be > > indexed_. A solution might be to keep your regexp, to add one to > > allow the /people/ pages to be crawled, and to start the crawling at > > those > > links: > > > > http://www.cs.princeton.edu/people/grad.php > > http://www.cs.princeton.edu/people/ugrad.php > > http://www.cs.princeton.edu/people/techstaff.php > > > > and any other that might contain links to personal pages. > > > > Hope this helps. > > > > Regards, > > S�bastien. > > > > > > --- Eric Money <[EMAIL PROTECTED]> wrote: > > > I have problem setting up the urlfilter. For example, I wanna crawl > > > all student pages at http://www.cs.princeton.edu/ which ends up with > > > http://www.cs.princeton/edu/~abcd sth like that. Thus I made the > > > starting page http://www.cs.princeton.edu/ and set up the > > > crawl-urlfilter as > > > > > > +^http://www.cs.princeton.edu/~([a-z0-9]*\.//)* > > > > > > But it just doesn't crawl anything, > > > if I remove the "~", it does crawl well, but also crawl many things > > > that I don't need, like ../course/...., how should I set up the > > > urlfilter properly? Thank you all. > > > > > > > __________________________________________________________________ > > D�couvrez le nouveau Yahoo! Mail : 250 Mo d'espace de stockage pour vos > > mails ! > > Cr�ez votre Yahoo! Mail sur http://fr.mail.yahoo.com/ > > > > CONFIDENTIALITY NOTICE: This E-Mail is intended only > for the use of the individual or entity to which it is addressed and may > contain information that is privileged, confidential and exempt from > disclosure under applicable law. If you have received this communication in > error, please do not distribute and delete the original message. Please > notify the sender by E-Mail at the address shown. Thank you for your > compliance.. > >
