Hi all, I'm try to figure out ways to improve Nutch focused crawling efficiency.
I'm looking for certain pages inside each domain which contains content I'm looking for. I'm unable to know that a certain URL contains what I'm looking for unless I parse it and do some analysis on it. Basically I was thinking about two methods to improve crawling efficiency: 1) Whenever a page is found which contains the data I'm looking for, improve overall score for all pages linking to it (and pages linking to them and so on...), assuming they have other links that point to content I'm looking for. 2) Once I already found several pages that contain relevant data - create a Regex automatically to match new urls which might contain usable content. I've started to read about the OPIC-score plugin but was unable to understand if it can help me or not with issue no. 1. Any idea guys? I will be very grateful for any help or things that can point me in the right direction. Thanks, Eran
