I had started your approach initially , ie building my indexing on lucene
only... but eventually completely dropped it.
Only working out of nutch with custom indexing plug-in and I'm quite happy
with it, the only downside I found is that the Nutch search bean does not
offer as much functionnality as I needed, so I had to build my own plug-in
too there. Later I decided to build a scoring plug-in too to focus the crawl
: works great.

I dont see the need for hadoop really either right now, but I like the idea
that if I need it it will be there because my crawls might become quite
big/long.

Not sure if I will move to solr indexing in the future, I'm trying to avoid
at this moment to minimize complexity.

-Raymond-



2009/6/6 KK <dioxide.softw...@gmail.com>

> Hi All,
> I've been using Solr and Lucene for some time. I started with Solr then
> moved to lucene because of more flexibility/openness in lucene, but I like
> both. As per my requirement I want to crawl webpages and add to lucene
> indexing. So far I've been doing crawling manually and adding them to
> lucene
> index though lucene APIs. The webpages have content which is a mix of say
> 5%
> english and remaining non-english[indian] content. To handle stemming/stop
> word removal for the english part, I wrote a small custom analyzer for use
> in lucene and thats working fairly well. Now I was thinking of doing the
> crawling part using Nutch. Does this sound OK. I went through the nutch
> wiki
> page and found that it supports a bunch of file types[like html/xml, pdf,
> odf, ppt, ms word etc ] but for me html is good enough. Also the wiki says
> that it builds distributed indexes using Hadoop[I've used Hadoop a bit]
> that
> uses teh map-reduce architecture. But for my requirement I dont need that
> much of things. Distributed inexing is not required, so essentially I dont
> need hadoop/map-reduce stuffs. So let me summarize things I want
> #. Crawl the webpage, I want nutch to hand me over the content, I dont want
> it to directly post that content to lucene by itself. Essentially I want to
> interfare in between crawling and indexing, as I've to use custom analyzer
> before the contents are indexed by lucene.
> #. For me html parsing is good enough[no need of pdf/odf/msword etc]
> #. No need of hadoop/map-reduce.
>
> I'ld like the users of nutch to let me know their views. Other option is to
> look for Java opensource crawlers that can do the job. I dont find any and
> I'm more interested in using something really good/well tested like nutch.
> Let me know your opinions.
>
> Thanks,
> KK.
>

Reply via email to