>
> Did you check Apache solr?
>
>
Exactly. Solr is a very powerful  open source indexer around.  Its a
subproject of Apache Lucene and uses lucene libraries for indexing. Well
supported by the community.  You could use Tika content extraction framework
to index not only html but also a lot of other rich text documents such as
doc, ppt, xls, rtf, pdf , even tar.gz, bzip, zip formats.

Initcron  Labs  has designed a appliance for solr by name Blaze.  Check it
out at  http://www.initcron.org/blaze .

There is also another lucene based project called Nutch  which provided web
specific features such as crawler, html parser, link graph database etc. You
can also integrate solr and nutch to build a solution.

Here are a few useful links
Solr: http://lucene.apache.org/solr/
Tika + Solr :
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika
Nutch: http://nutch.apache.org/about.html
Solr + Nutch: http://wiki.apache.org/nutch/RunningNutchAndSolr
Lucene: http://lucene.apache.org/java/docs/index.html


If you are looking for assistance/consulting to implement solr based
solution, contact me off the list.

Thanks
Gourav
www.initcron.org



> > Dear luggies
> >
> > I am planning to have a search engine similar to google for my intranet
> > (actually it spans entire India, with about 2000 intranet sites). I
> expect
> > about 500-600gb data and about 1 million pages. I found
> > htdig(htdig.org) and mnogosearch(mnogosearch.org) to be suitable.
> >
>
_______________________________________________
ILUGC Mailing List:
http://www.ae.iitm.ac.in/mailman/listinfo/ilugc

Reply via email to