hi... i might be a little out of my league.. but here goes...
i'm in need of an app to crawl through sections of sites, and to return pieces of information. i'm not looking to do any indexing, just returning raw html/text... however, i need the ability to set certain criteria to help define the actual pages that get returned... a given crawling process, would normally start at some URL, and iteratively fetch files underneath the URL. nutch does this as well as providing some additional functionality. i need more functionality.... in particular, i'd like to be able to modify the way nutch handles forms, and links/queries on a given page. i'd like to be able to: for forms: allow the app to handle POST/GET forms allow the app to select (implement/ignore) given elements within a form track the FORM(s) for a given URL/page/level of the crawl for links: allow the app to either include/exclude a given link for a given page/URL via regex parsing or list of URLs allow the app to handle querystring data, ie to include/exclude the URL+Query based on regex parsing or simple text comparison data extraction: abiility to do xpath/regex extraction based on the DOM permit multiple xpath/regex functions to be run on a given page this kind of functionality would allow the 'nutch' function to be relatively selective regarding the ability to crawl through a site and extract the required information.... any thoughts/comments/ideas/etc.. regarding this process. if i shouldn't use nutch, are there any suggestions as to what app i should use. thanks -bruce
