Hi all I am relatively new to nutch and I am trying to understand how it crawls websites, but more specifically, how it creates and prioritises its Fetch List. So I have a couple of questions I would like to ask:
1. Which are Nutch crawl URL sources? I think they are both WebDB and segments but I am not sure. 2. How does nutch prioritise crawling? By content expiration date only? 3. Is there some way affect the way nutch orders URLs to be fetched? I've been reading the Generator class but haven't found an extension point for this. Thanks in advance... Rodrigo