> > Did you check Apache solr? > > Exactly. Solr is a very powerful open source indexer around. Its a subproject of Apache Lucene and uses lucene libraries for indexing. Well supported by the community. You could use Tika content extraction framework to index not only html but also a lot of other rich text documents such as doc, ppt, xls, rtf, pdf , even tar.gz, bzip, zip formats.
Initcron Labs has designed a appliance for solr by name Blaze. Check it out at http://www.initcron.org/blaze . There is also another lucene based project called Nutch which provided web specific features such as crawler, html parser, link graph database etc. You can also integrate solr and nutch to build a solution. Here are a few useful links Solr: http://lucene.apache.org/solr/ Tika + Solr : http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika Nutch: http://nutch.apache.org/about.html Solr + Nutch: http://wiki.apache.org/nutch/RunningNutchAndSolr Lucene: http://lucene.apache.org/java/docs/index.html If you are looking for assistance/consulting to implement solr based solution, contact me off the list. Thanks Gourav www.initcron.org > > Dear luggies > > > > I am planning to have a search engine similar to google for my intranet > > (actually it spans entire India, with about 2000 intranet sites). I > expect > > about 500-600gb data and about 1 million pages. I found > > htdig(htdig.org) and mnogosearch(mnogosearch.org) to be suitable. > > > _______________________________________________ ILUGC Mailing List: http://www.ae.iitm.ac.in/mailman/listinfo/ilugc
