Hi All, I am in need of an (open-source) web crawler (a-la wget), but one that does all of the following: 1. Performs breadth-first search, not depth-first search. (so stopping condition based of disk space will give a wide crawl, rather than a deep crawl). 2. Can let me defined whether to recurse into a link or not, based on criteria (e.g. leaving domain or not being the most obvious, but also by regexping the url etc.) 3. optimally should allow me to provide a lambda function that will return a rating based on page content, so I decide whether to recurse and where to avoid.
Anyone? I will write such a thing, if none is found, but really prefer not to. Shachar Tal Verint Systems This electronic message contains information from Verint Systems, which may be privileged and confidential. The information is intended to be for the use of the individual(s) or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of the contents of this information is prohibited. If you have received this electronic message in error, please notify us by replying to this email. ================================================================= To unsubscribe, send mail to [EMAIL PROTECTED] with the word "unsubscribe" in the message body, e.g., run the command echo unsubscribe | mail [EMAIL PROTECTED]
