This sure is quite interesting. But I bet it is not going to be easy.
I have heard about lucene. Since it is Java .... But I hear it is there in Python as well. Even perl. -Girish On Fri, Oct 22, 2010 at 11:57 PM, Gourav Shah <[email protected]> wrote: >> >> Did you check Apache solr? >> >> > Exactly. Solr is a very powerful open source indexer around. Its a > subproject of Apache Lucene and uses lucene libraries for indexing. Well > supported by the community. You could use Tika content extraction framework > to index not only html but also a lot of other rich text documents such as > doc, ppt, xls, rtf, pdf , even tar.gz, bzip, zip formats. > > Initcron Labs has designed a appliance for solr by name Blaze. Check it > out at http://www.initcron.org/blaze . > > There is also another lucene based project called Nutch which provided web > specific features such as crawler, html parser, link graph database etc. You > can also integrate solr and nutch to build a solution. > > Here are a few useful links > Solr: http://lucene.apache.org/solr/ > Tika + Solr : > http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika > Nutch: http://nutch.apache.org/about.html > Solr + Nutch: http://wiki.apache.org/nutch/RunningNutchAndSolr > Lucene: http://lucene.apache.org/java/docs/index.html > > > If you are looking for assistance/consulting to implement solr based > solution, contact me off the list. > > Thanks > Gourav > www.initcron.org > > > >> > Dear luggies >> > >> > I am planning to have a search engine similar to google for my intranet >> > (actually it spans entire India, with about 2000 intranet sites). I >> expect >> > about 500-600gb data and about 1 million pages. I found >> > htdig(htdig.org) and mnogosearch(mnogosearch.org) to be suitable. >> > >> > _______________________________________________ > ILUGC Mailing List: > http://www.ae.iitm.ac.in/mailman/listinfo/ilugc > -- Gayatri Hitech http://gayatri-hitech.com [email protected] _______________________________________________ ILUGC Mailing List: http://www.ae.iitm.ac.in/mailman/listinfo/ilugc
