Re: To nutch or not to nutch?

Alex McLintock Thu, 19 Aug 2010 11:22:43 -0700

Hello Gonzalo,

Did you mean to post to the dev list?

Further comments inline

On 19 August 2010 18:25, Gonzalo Aguilar Delgado
<[email protected]> wrote:
> Hi there!
>
> I'm building a crawler that will "understand" some kind of pages. I want to
> be able to process a restricted group
> of websites.

Nutch has the capability to configure a URL filter which can limit the
hosts to a specific set of regular expressions.

> In essence, for example:  I want to search for reviews of the products of my
> company in some blogs I well know.

That sounds like a standard data mining requirement.

> I don't know if Nutch can help me here.

Well, it can, but not out of the box. - It depends on what sort of
automation you want.
Nutch can crawl all those sites and build up a SolR/Lucene index for
you to search through, but I am guessing that wont help you very much.

> What I'm currently doing is a crawler that fetches pages, transforms them
> with the template designed for the site with xslt

Eh? you are using xslt to transform random web pages? Doesnt the xslt
fall over whenever it finds non well formed xml?

> and the parses content.

Parses it for what? What do you do with it?

> The question here is: Can this be done well with Nutch or will it imply a
> big overhead?

I don't think this is *easy* with Nutch. The overhead may be worth it
if you want to do the web crawling on a small cluster rather than one
machine.

There may be other better data mining tools, but I'm not sure I can
recommend anything right now.

> What plugins will needs to be developed?

Well that depends on what you want. Presumably you want something that
identifies the web page as a review of your product so that it can be
highlighted in the index. How do you want to do that?

> Thank you!

I've been thinking about this for some time - but to search for book
reviews instead of product reviews. I can't say that I have a working
system, but maybe others do.

Alex

Re: To nutch or not to nutch?

Reply via email to