Hi,
I'm new to nutch. I don't even know if this serves my purpose. I'm working on a machine learning problem for which I need corpus.can be obtained by crawling web. (required dataset is not available.) but my requirements are as follows: Crawler should crawl links of only certain pattern, (www.domain.com/id) it should fetch only specific data from the page crawled(instead of entire content of page). say <div id="reqd1"></div> from page of pattern1 and <div id="reqd2"></div> from page of pattern2. then merge both. that will be one example data I require.Like wise I need few hundreds/thousands of pages(examples). And finally all the fetched text should be store in some kind of database/XML files, So that I can use it for training my program. Please can any one tell me, Is nutch the right choice for me? If not what would be the best method to accomplish my task? Regards, KishoreKumar.
