Hello, I'm trying to understand how to start with initial set of URLs and continue fetching new URLS and re-fetching existing URLS (when they due to re-fetch).
I have run some tests in order to understand and test the software behavior. Now I have some questions for you guys and seeking your help. 1. I have set db.default.fetch.interval to 1 (in nutch-default.xml) but I have noticed that fetchInterval field in Page object is being set to current time + 7 days while URL link data is being read from the fetchlist. Can somebody explain why or am I not reading the code correctly? 2. I have modified code to ignore fetchInterval value coming from the fetchlist, meaning that fetchInterval stays equal to the initial value - current time. After I do the following commands: fetch, db update and generate db segments, I'm getting new fetchlist but this list doesn't include my original sites. Even so their next fetch time should be in past already. Can somebody help me to understand when those URLS will be fetch? 3. Looks like fetcher fail to extract links from http://www.eltweb.com. I know that there are some formats (looks like some HTML variations also) that are not supported. Where can I find information what is currently supported? 4. Some of the out-links discovered during the fetch (for instance: http://www.webct.com/software/viewpage?name=software_campus_edition or http://v.extreme-dm.com/?login=cguilfor ) are being ignored (not included in the next fetchlist after executing [generate db segments] command). Is there known reason for this? Is there some documentation describing supported URL types. I'm still new to this software and tried to explain what I did and hope this was clear enough, but I'm not sure I have asked the right questions. Thanks, Daniel
