Re: To nutch or not to nutch?

Gonzalo Aguilar Delgado Fri, 20 Aug 2010 00:11:07 -0700

Hi Alex, 

I will answer inline so we can follow comments...



On jue, 2010-08-19 at 19:21 +0100, Alex McLintock wrote: 
Hello Gonzalo,
> 
> Did you mean to post to the dev list?
> Yes! Users normally don't know what to implement if missing
features...


> Further comments inline
> 
> On 19 August 2010 18:25, Gonzalo Aguilar Delgado
> <[email protected]> wrote:
> > Hi there!
> >
> > I'm building a crawler that will "understand" some kind of pages. I
want to
> > be able to process a restricted group
> > of websites.
> 
> Nutch has the capability to configure a URL filter which can limit the
> hosts to a specific set of regular expressions.
> 
That's normal. This will be not much problem.


> > In essence, for example:  I want to search for reviews of the
products of my
> > company in some blogs I well know.
> 
> That sounds like a standard data mining requirement.
> Exactly!


> > I don't know if Nutch can help me here.
> 
> Well, it can, but not out of the box. - It depends on what sort of
> automation you want.
> Nutch can crawl all those sites and build up a SolR/Lucene index for
> you to search through, but I am guessing that wont help you very much.
> Nope, What I need is to extract some fields from pages... And then
maybe SolR Lucene can help...
But not with the whole text since blogs, for example, tends to include
much garbage... 


> > What I'm currently doing is a crawler that fetches pages, transforms
them
> > with the template designed for the site with xslt
> 
> Eh? you are using xslt to transform random web pages? Doesnt the xslt
> fall over whenever it finds non well formed xml?
> What I really do is to normalize input with TagSoup and the proccess
each web with the custom template.
This way everything works...

The problem I have is that for example, MySpaces, is too big for my
crawler. I really never tried to parse it but surely it will
take lots of time to parse... So I will need to scale in the future.



> > and the parses content.
> 
> Parses it for what? What do you do with it?
> 
What I want to do is a kind of Buzz Engine. It will tell me what buzz is
gaining a product in the web. It must parse, blogs, pages of rankings,
oscommerce pages, rss, etc...



> > The question here is: Can this be done well with Nutch or will it
imply a
> > big overhead?
> 
> I don't think this is *easy* with Nutch. The overhead may be worth it
> if you want to do the web crawling on a small cluster rather than one
> machine.
> 
Maybe in the future, but not now... So I think is better build my custom
one...


> There may be other better data mining tools, but I'm not sure I can
> recommend anything right now.
> 
This is very specific so I'm not sure if something will help me.


> > What plugins will needs to be developed?
> 
> Well that depends on what you want. Presumably you want something that
> identifies the web page as a review of your product so that it can be
> highlighted in the index. How do you want to do that?
> 
Pufff! I'm lost on this... Can I write you a personal mail to explain
what I want to do and how this will work?


> 
> > Thank you!
> 
> I've been thinking about this for some time - but to search for book
> reviews instead of product reviews. I can't say that I have a working
> system, but maybe others do.
> 
I already can parse some webs... I'm triying to do it better, multisite
and social 

Let me contact so I can explain it better...

Tnx Alex!


> Alex
>

Re: To nutch or not to nutch?

Reply via email to