Hello Gonzalo, Did you mean to post to the dev list?
Further comments inline On 19 August 2010 18:25, Gonzalo Aguilar Delgado <[email protected]> wrote: > Hi there! > > I'm building a crawler that will "understand" some kind of pages. I want to > be able to process a restricted group > of websites. Nutch has the capability to configure a URL filter which can limit the hosts to a specific set of regular expressions. > In essence, for example: I want to search for reviews of the products of my > company in some blogs I well know. That sounds like a standard data mining requirement. > I don't know if Nutch can help me here. Well, it can, but not out of the box. - It depends on what sort of automation you want. Nutch can crawl all those sites and build up a SolR/Lucene index for you to search through, but I am guessing that wont help you very much. > What I'm currently doing is a crawler that fetches pages, transforms them > with the template designed for the site with xslt Eh? you are using xslt to transform random web pages? Doesnt the xslt fall over whenever it finds non well formed xml? > and the parses content. Parses it for what? What do you do with it? > The question here is: Can this be done well with Nutch or will it imply a > big overhead? I don't think this is *easy* with Nutch. The overhead may be worth it if you want to do the web crawling on a small cluster rather than one machine. There may be other better data mining tools, but I'm not sure I can recommend anything right now. > What plugins will needs to be developed? Well that depends on what you want. Presumably you want something that identifies the web page as a review of your product so that it can be highlighted in the index. How do you want to do that? > Thank you! I've been thinking about this for some time - but to search for book reviews instead of product reviews. I can't say that I have a working system, but maybe others do. Alex

