Similarly, if you have a monster CrawlDb and you use topN approach, you could 
change Generator to generate M fetchlists at once, each with N URLs.  This will 
still have URLs ordered by score, at least score known at the time of the 
Generator run (i.e. before CrawlDb is updated with data from subsequent 
fetching).  This way you'll waste less time in generate step.  Patches for this 
would be great! :)

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Wednesday, May 7, 2008 10:24:37 AM
> Subject: Re: How to authenticate with cookies?
> 
> Heh, I think this is another good use-case for HostDB, which doesn't yet 
> exist.  
> If this existed, we could store a cookie for each host in HostDB, and include 
> it 
> in CrawlDatum entries used in Fetcher(2).  You'd have to dig down to 
> o.a.n.protocol.httpclient.Http and add cookies to the request there, I 
> believe.
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> ----- Original Message ----
> > From: Yoav Shapira 
> > To: [email protected]
> > Sent: Wednesday, May 7, 2008 9:37:01 AM
> > Subject: Re: How to authenticate with cookies?
> > 
> > On Tue, May 6, 2008 at 10:47 PM, Duan, Niu wrote:
> > > Looks like Nutch doesn't support form-based authentication out of the 
> > > box.  
> > You may have to create your own httpclient or modify it for >dealing with 
> > form-based authentication.  Form-based authentication requires dedicated 
> > input 
> 
> > parameters (j_username, j_password) to be >placed in the initial request 
> message 
> > sent to the server.  Once authenticated, a cookie named jsessionid is going 
> > to 
> 
> > be used to track the >user session.
> > 
> > Thank you Nick.
> > 
> > What I'm actually looking for is a little different.  My server uses a
> > custom cookie name and value to indicate an authenticated user.  I
> > have this cookie (a valid version thereof, and let's assume for now
> > I've gotten past expiration issues) in a text file.
> > 
> > How do I tell Nutch's crawler to include a cookie name and value with
> > each HTTP request?
> > 
> > Yoav

Reply via email to