Re: Intranet crawl and re-fetch - newbie question

Daniel D. Wed, 08 Jun 2005 21:05:49 -0700

Hi Piotr,

 Thanks for the information.

You are right, those URLs (generated with -refetchonly) are not being 
fetched. In my bullet # 4 I have said that they are fetched as I was mislead 
by presents of data files (even so they were very small and I didn't check 
the content).

 I'm trying to understand how to start with initial set of URLs and continue 
fetching new URLS and re-fetching existing URLS (when they due to re-fetch).

I will post the questions below in nutch-dev list also.

   1. I have set db.default.fetch.interval to 1 (in nutch-default.xml) 
   but I have noticed that fetchInterval field in Page object is being set to 
   current time + 7 days while URL link data is being read from the fetchlist. 
   Can somebody explain why or am I not reading the code correctly? 
   2. I have modified code to ignore fetchInterval value coming from the 
   fetchlist, meaning that fetchInterval stays equal to the initial value - 
   current time. After I do the following commands: fetch, db update
and generate
   db segments, I'm getting new fetchlist but this list doesn't include my 
   original sites. Even so their next fetch time should be in past already. Can 
   somebody help me to understand when those URLS will be fetch? 
   3. Looks like fetcher fail to extract links from http://www.eltweb.com. 
   I know that there are some formats (looks like some HTML variations also) 
   that are not supported. Where can I find information what is currently 
   supported? 
   4. Some of the out-links discovered during the fetch (for instance: 
   http://www.webct.com/software/viewpage?name=software_campus_edition or 
   http://v.extreme-dm.com/?login=cguilfor ) are being ignored (not 
   included in the next fetchlist after executing [generate db segments] 
   command). Is there known reason for this? Is there some documentation 
   describing supported URL types. 

Thanks,

Daniel

On 6/8/05, Piotr Kosiorowski <[EMAIL PROTECTED]> wrote: 
> 
> Hello Daniel,
> I raised -refetchonly question on nutch-dev list two days ago (subject:
> -refetchonly investigation). I have described my tests and code findings
> there. If you are interested you can check it there but for me the most
> important is Doug answer so I will cite it here:
> <cite>
> The original rationale for the "-refetchonly" option was to permit
> indexing of all of the urls known the the database, with anchor text,
> but without fetching them. Thus one can, e.g., provide an index of 10M
> urls while only actually fetching 1M urls. I have never actually used
> this feature myseufl. I don't know whether other folks have ever used
> it sucessfully, nor whether such a feature is in fact desired.
> </cite>
> 
> I do not personally find such feature useful but maybe it is for
> somebody. I would like to add a feature that allows one to generate
> fetchlist that would contain only urls that were already fetched (and
> for symmetry the opposite - urls that were never fetched) - but at the
> moment I am a bit busy with my personal life and work - but I have it on
> my TODO list (I will get back to your questions than too).
> Regards
> Piotr
> 
> 
> 
>

Re: Intranet crawl and re-fetch - newbie question

Reply via email to