Re: Does nutch serve my purpose?

ogjunk-nutch Fri, 30 May 2008 06:35:10 -0700

It sounds like the right choice, though you will have to write your own page 
parsing plugin that knows which parts of the page to keep and which ones to 
throw away.  The final output is stored in HDFS in a non-XML format, but there 
are tools that allow you easy and sequential reading from those files, so you 
can post-process if you want to convert them to XML, for example.



Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: KishoreKumar Bairi <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Friday, May 30, 2008 4:22:09 AM
> Subject: Does nutch serve my purpose?
> 
> Hi,
> 
> 
> I'm new to nutch. I don't even know if this serves my purpose.
> 
> I'm working on a machine learning problem for which I need corpus.can be
> obtained by crawling web. (required dataset is not available.)
> but my requirements are as follows:
> 
> Crawler should crawl links of only certain pattern, (www.domain.com/id)
> it should fetch only specific data from the page crawled(instead of entire
> content of page).
>   say 
from page of pattern1 and 

> id="reqd2">
from page of pattern2.  then merge both. that will be
> one    example data I require.Like wise I need few hundreds/thousands of
> pages(examples).
> 
> And finally all the fetched text should be store in some kind of
> database/XML files, So that I can use it for training my program.
> 
> Please can any one tell me, Is nutch the right choice for me? If not what
> would be the best method to accomplish my task?
> 
> 
> Regards,
> KishoreKumar.

Re: Does nutch serve my purpose?

Reply via email to