Hi Alex, I will answer inline so we can follow comments...
On jue, 2010-08-19 at 19:21 +0100, Alex McLintock wrote: Hello Gonzalo, > > Did you mean to post to the dev list? > Yes! Users normally don't know what to implement if missing features... > Further comments inline > > On 19 August 2010 18:25, Gonzalo Aguilar Delgado > <[email protected]> wrote: > > Hi there! > > > > I'm building a crawler that will "understand" some kind of pages. I want to > > be able to process a restricted group > > of websites. > > Nutch has the capability to configure a URL filter which can limit the > hosts to a specific set of regular expressions. > That's normal. This will be not much problem. > > In essence, for example: I want to search for reviews of the products of my > > company in some blogs I well know. > > That sounds like a standard data mining requirement. > Exactly! > > I don't know if Nutch can help me here. > > Well, it can, but not out of the box. - It depends on what sort of > automation you want. > Nutch can crawl all those sites and build up a SolR/Lucene index for > you to search through, but I am guessing that wont help you very much. > Nope, What I need is to extract some fields from pages... And then maybe SolR Lucene can help... But not with the whole text since blogs, for example, tends to include much garbage... > > What I'm currently doing is a crawler that fetches pages, transforms them > > with the template designed for the site with xslt > > Eh? you are using xslt to transform random web pages? Doesnt the xslt > fall over whenever it finds non well formed xml? > What I really do is to normalize input with TagSoup and the proccess each web with the custom template. This way everything works... The problem I have is that for example, MySpaces, is too big for my crawler. I really never tried to parse it but surely it will take lots of time to parse... So I will need to scale in the future. > > and the parses content. > > Parses it for what? What do you do with it? > What I want to do is a kind of Buzz Engine. It will tell me what buzz is gaining a product in the web. It must parse, blogs, pages of rankings, oscommerce pages, rss, etc... > > The question here is: Can this be done well with Nutch or will it imply a > > big overhead? > > I don't think this is *easy* with Nutch. The overhead may be worth it > if you want to do the web crawling on a small cluster rather than one > machine. > Maybe in the future, but not now... So I think is better build my custom one... > There may be other better data mining tools, but I'm not sure I can > recommend anything right now. > This is very specific so I'm not sure if something will help me. > > What plugins will needs to be developed? > > Well that depends on what you want. Presumably you want something that > identifies the web page as a review of your product so that it can be > highlighted in the index. How do you want to do that? > Pufff! I'm lost on this... Can I write you a personal mail to explain what I want to do and how this will work? > > > Thank you! > > I've been thinking about this for some time - but to search for book > reviews instead of product reviews. I can't say that I have a working > system, but maybe others do. > I already can parse some webs... I'm triying to do it better, multisite and social Let me contact so I can explain it better... Tnx Alex! > Alex >

