very interesting,
can you tell us how SVM performs ?
I would be interested to hear how many pages per second you can handle on your server and what is the quality (recall, precision, f1).
Thanks a lot! Stefan
Am 06.04.2005 um 22:20 schrieb Piotr Kosiorowski:
Hello all,
I would like to thank all nutch developers and users for high quality code and support that helped us to deploy beta version of travel related web search engine on www.igougo.com site. Decision to base our solution on nutch was a perfect one - the quality of nutch code allowed us to build a proptotype quickly and integrate our code easily.
Search engine runs on Opteron boxes with Linux and JVM 1.5 (64-bit version). Web search engine is based on nutch code with some modifications. Current solution uses latest patch for usage of host name and title in ranking. We use customized set of boosts for fields (I will send a separate email about it as I have promised some time ago).
We have our own implementation of WebDB - based on mysql. It was good for our purposes as it allowed to easily integrate classification of pages and additional information we needed but as we want to grow our index size we will have performance problems - so we will have to change it in near future (we are interested in map reduce implementation here).
Classification of pages during fetching was done using Support Vector Machines.
It is released as beta to allow users to interact with it but there is still a lot of work to do especially in areas of relevancy and spam removal. Changes would be intoduced gradually in following months.
I will add a link to our search engine to nutch Wiki as soon as it will be fully transfered to Apache to avoid problems in the middle of transition.
Once again, thank you all for high quality search engine and I am looking forward to use nutch in future,
Regards Piotr Kosiorowski Senior Software Developer Travel Search Technologies Sabre Holdings
PS. For interested Sabre Holdings press release is here: http://www.forbes.com/home/feeds/ap/2005/04/05/ap1925698.html
--------------------------------------------------------------- company: http://www.media-style.com forum: http://www.text-mining.org blog: http://www.find23.net
