It sounds like the right choice, though you will have to write your own page parsing plugin that knows which parts of the page to keep and which ones to throw away. The final output is stored in HDFS in a non-XML format, but there are tools that allow you easy and sequential reading from those files, so you can post-process if you want to convert them to XML, for example.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: KishoreKumar Bairi <[EMAIL PROTECTED]> > To: [email protected] > Sent: Friday, May 30, 2008 4:22:09 AM > Subject: Does nutch serve my purpose? > > Hi, > > > I'm new to nutch. I don't even know if this serves my purpose. > > I'm working on a machine learning problem for which I need corpus.can be > obtained by crawling web. (required dataset is not available.) > but my requirements are as follows: > > Crawler should crawl links of only certain pattern, (www.domain.com/id) > it should fetch only specific data from the page crawled(instead of entire > content of page). > say from page of pattern1 and > id="reqd2"> from page of pattern2. then merge both. that will be > one example data I require.Like wise I need few hundreds/thousands of > pages(examples). > > And finally all the fetched text should be store in some kind of > database/XML files, So that I can use it for training my program. > > Please can any one tell me, Is nutch the right choice for me? If not what > would be the best method to accomplish my task? > > > Regards, > KishoreKumar.
