This sure is quite interesting.

But I bet it is not going to be easy.

I have heard about lucene. Since it is Java ....

But I hear it is there in Python as well. Even perl.

-Girish

On Fri, Oct 22, 2010 at 11:57 PM, Gourav Shah <[email protected]> wrote:
>>
>> Did you check Apache solr?
>>
>>
> Exactly. Solr is a very powerful  open source indexer around.  Its a
> subproject of Apache Lucene and uses lucene libraries for indexing. Well
> supported by the community.  You could use Tika content extraction framework
> to index not only html but also a lot of other rich text documents such as
> doc, ppt, xls, rtf, pdf , even tar.gz, bzip, zip formats.
>
> Initcron  Labs  has designed a appliance for solr by name Blaze.  Check it
> out at  http://www.initcron.org/blaze .
>
> There is also another lucene based project called Nutch  which provided web
> specific features such as crawler, html parser, link graph database etc. You
> can also integrate solr and nutch to build a solution.
>
> Here are a few useful links
> Solr: http://lucene.apache.org/solr/
> Tika + Solr :
> http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika
> Nutch: http://nutch.apache.org/about.html
> Solr + Nutch: http://wiki.apache.org/nutch/RunningNutchAndSolr
> Lucene: http://lucene.apache.org/java/docs/index.html
>
>
> If you are looking for assistance/consulting to implement solr based
> solution, contact me off the list.
>
> Thanks
> Gourav
> www.initcron.org
>
>
>
>> > Dear luggies
>> >
>> > I am planning to have a search engine similar to google for my intranet
>> > (actually it spans entire India, with about 2000 intranet sites). I
>> expect
>> > about 500-600gb data and about 1 million pages. I found
>> > htdig(htdig.org) and mnogosearch(mnogosearch.org) to be suitable.
>> >
>>
> _______________________________________________
> ILUGC Mailing List:
> http://www.ae.iitm.ac.in/mailman/listinfo/ilugc
>



-- 
Gayatri Hitech

http://gayatri-hitech.com
[email protected]
_______________________________________________
ILUGC Mailing List:
http://www.ae.iitm.ac.in/mailman/listinfo/ilugc

Reply via email to