Hello,

 I'm trying to understand how to start with initial set of URLs and continue 
fetching new URLS and re-fetching existing URLS (when they due to re-fetch).

I have run some tests in order to understand and test the software behavior. 
Now I have some questions for you guys and seeking your help.

 
   1. I have set db.default.fetch.interval to 1 (in nutch-default.xml) 
   but I have noticed that fetchInterval field in Page object is being set to 
   current time + 7 days while URL link data is being read from the fetchlist. 
   Can somebody explain why or am I not reading the code correctly? 
   2. I have modified code to ignore fetchInterval value coming from the 
   fetchlist, meaning that fetchInterval stays equal to the initial value - 
   current time. After I do the following commands: fetch, db update
and generate
   db segments, I'm getting new fetchlist but this list doesn't include my 
   original sites. Even so their next fetch time should be in past already. Can 
   somebody help me to understand when those URLS will be fetch? 
   3. Looks like fetcher fail to extract links from http://www.eltweb.com. 
   I know that there are some formats (looks like some HTML variations also) 
   that are not supported. Where can I find information what is currently 
   supported? 
   4. Some of the out-links discovered during the fetch (for instance: 
   http://www.webct.com/software/viewpage?name=software_campus_edition or 
   http://v.extreme-dm.com/?login=cguilfor ) are being ignored (not 
   included in the next fetchlist after executing [generate db segments] 
   command). Is there known reason for this? Is there some documentation 
   describing supported URL types. 

  I'm still new to this software and tried to explain what I did and hope 
this was clear enough, but I'm not sure I have asked the right questions. 

 Thanks,

Daniel

Reply via email to