I had started your approach initially , ie building my indexing on lucene only... but eventually completely dropped it. Only working out of nutch with custom indexing plug-in and I'm quite happy with it, the only downside I found is that the Nutch search bean does not offer as much functionnality as I needed, so I had to build my own plug-in too there. Later I decided to build a scoring plug-in too to focus the crawl : works great.
I dont see the need for hadoop really either right now, but I like the idea that if I need it it will be there because my crawls might become quite big/long. Not sure if I will move to solr indexing in the future, I'm trying to avoid at this moment to minimize complexity. -Raymond- 2009/6/6 KK <dioxide.softw...@gmail.com> > Hi All, > I've been using Solr and Lucene for some time. I started with Solr then > moved to lucene because of more flexibility/openness in lucene, but I like > both. As per my requirement I want to crawl webpages and add to lucene > indexing. So far I've been doing crawling manually and adding them to > lucene > index though lucene APIs. The webpages have content which is a mix of say > 5% > english and remaining non-english[indian] content. To handle stemming/stop > word removal for the english part, I wrote a small custom analyzer for use > in lucene and thats working fairly well. Now I was thinking of doing the > crawling part using Nutch. Does this sound OK. I went through the nutch > wiki > page and found that it supports a bunch of file types[like html/xml, pdf, > odf, ppt, ms word etc ] but for me html is good enough. Also the wiki says > that it builds distributed indexes using Hadoop[I've used Hadoop a bit] > that > uses teh map-reduce architecture. But for my requirement I dont need that > much of things. Distributed inexing is not required, so essentially I dont > need hadoop/map-reduce stuffs. So let me summarize things I want > #. Crawl the webpage, I want nutch to hand me over the content, I dont want > it to directly post that content to lucene by itself. Essentially I want to > interfare in between crawling and indexing, as I've to use custom analyzer > before the contents are indexed by lucene. > #. For me html parsing is good enough[no need of pdf/odf/msword etc] > #. No need of hadoop/map-reduce. > > I'ld like the users of nutch to let me know their views. Other option is to > look for Java opensource crawlers that can do the job. I dont find any and > I'm more interested in using something really good/well tested like nutch. > Let me know your opinions. > > Thanks, > KK. >