Re: Crawl-urlfiter problem

Eric Money Thu, 07 Apr 2005 12:30:21 -0700

Hi,

Yes I tried both. It just does not crawl anything.
Also I tried http://www.cs.princeton.edu/~*,
it works the same as http://www.cs.princeton.edu,
so wierd.


On Apr 7, 2005 3:10 PM, Hauck, William B. <[EMAIL PROTECTED]> wrote:
>  Eric,
> 
> Have you tried with _only_ the ~ instead of the whole ending? ...
> Turn this:
> +^http://www.cs.princeton.edu/~([a-z0-9]*\.//)*
> Into this:
> +^http://www.cs.princeton.edu/~
> 
> Also, you might need to escape the ~ ...
> +^http://www.cs.princeton.edu/\~
> 
> If it works, please reply back to the list with the answer.
> 
> Good luck,
> 
> bill
> 
> -----Original Message-----
> From: Eric Money [mailto:[EMAIL PROTECTED]
> Sent: Thursday, April 07, 2005 2:59 PM
> To: S�bastien LE CALLONNEC
> Cc: [email protected]
> Subject: Re: Crawl-urlfiter problem
> 
> Hi S�bastien:
> 
> Thank you for you reply. I've also tried that way, but more or less some 
> pages will be crwaled that I don't need, I wanna focus on 
> http://www.cs.princeton.edu/~, and anything except that is not need.
> Also, I took princeton for example, but I am also crawling other univs., like 
> http://theory.lcs.mit.edu/~, so it is a general case, I wanna find out a 
> general solution. Thank you.
> 
> On Apr 7, 2005 2:52 PM, S�bastien LE CALLONNEC <[EMAIL PROTECTED]> wrote:
> > Hi Eric,
> >
> > I might be wrong (and people here will answer properly if I am), but
> > the crawl-urlfilter file contains the urls _to be crawled_, not _to be
> > indexed_.  A solution might be to keep your regexp, to add one to
> > allow the /people/ pages to be crawled, and to start the crawling at
> > those
> > links:
> >
> > http://www.cs.princeton.edu/people/grad.php
> > http://www.cs.princeton.edu/people/ugrad.php
> > http://www.cs.princeton.edu/people/techstaff.php
> >
> > and any other that might contain links to personal pages.
> >
> > Hope this helps.
> >
> > Regards,
> > S�bastien.
> >
> >
> > --- Eric Money <[EMAIL PROTECTED]> wrote:
> > > I have problem setting up the urlfilter. For example, I wanna crawl
> > > all student pages at http://www.cs.princeton.edu/ which ends up with
> > > http://www.cs.princeton/edu/~abcd sth like that. Thus I made the
> > > starting page http://www.cs.princeton.edu/ and set up the
> > > crawl-urlfilter as
> > >
> > > +^http://www.cs.princeton.edu/~([a-z0-9]*\.//)*
> > >
> > > But it just doesn't crawl anything,
> > > if I remove the "~", it does crawl well, but also crawl many things
> > > that I don't need, like ../course/...., how should I set up the
> > > urlfilter properly? Thank you all.
> > >
> >
> > __________________________________________________________________
> > D�couvrez le nouveau Yahoo! Mail : 250 Mo d'espace de stockage pour vos 
> > mails !
> > Cr�ez votre Yahoo! Mail sur http://fr.mail.yahoo.com/
> >
> 
> CONFIDENTIALITY NOTICE: This E-Mail is intended only
> for the use of the individual or entity to which it is addressed and may 
> contain information that is privileged, confidential and exempt from 
> disclosure under applicable law. If you have received this communication in 
> error, please do not distribute and delete the original message.  Please 
> notify the sender by E-Mail at the address shown. Thank you for your 
> compliance..
> 
>

Re: Crawl-urlfiter problem

Reply via email to