Re: Nutch - new public server

Piotr Kosiorowski Fri, 08 Apr 2005 11:16:34 -0700

Hello Stefan, We use SVM-light. Your results look really good. We will have to spend some time on our classification process soon.

I hope after initial release I will find some more time to contribute some of the changes we did and maybe help with new features. Thanks Piotr

Stefan Groschupf wrote:

Hi Piotr, sounds like you regenerate your feature vector for every classification process right? Make that sense? Is your training's material every time different? We found svm to slow and use a custom algorithm until indexing pages. (apple dual g5, nutch pages per second: 177.87161, precision = 0.980217038, recall = 0.86788284, f1 = 0.920635899, train and test material from dmoz) This include feature vector generation for each document.

Is it allowed to ask what kind of svm implementation you use, the c based from T. Joachims or any java implemention, or may a custom?

However, I agree in general I would love to see custom meta data in the webdb, may one day I can find people that are interested to implement this and we can contribute that to nutch.
Thanks for the information.
Stefan
Am 08.04.2005 um 17:56 schrieb Piotr Kosiorowski:
Hi Stefan, I was quite surprised with SVM performance. We wanted to achive high precision and quickly we were able to achive over 90% , but recall was about 60%. These numbers were ok for us to begin with and we plan to tune it further in future. We are not categorizing pages during fetching in fact - we have a separate step that categorizes the whole segment. So for 1mln pages segment is takes ~45min to - generate feature vectors from segment data, classifiy it and generate output file with urls and classification score. Majority of the time is spent on feature vector generation and file operations - classification alone is quite fast.
Regards
Piotr
Stefan Groschupf wrote:
Piotr, very interesting, can you tell us how SVM performs ? I would be interested to hear how many pages per second you can handle on your server and what is the quality (recall, precision, f1). Thanks a lot! Stefan Am 06.04.2005 um 22:20 schrieb Piotr Kosiorowski:
Hello all,
I would like to thank all nutch developers and users for high quality code and support that helped us to deploy beta version of travel related web search engine on www.igougo.com site. Decision to base our solution on nutch was a perfect one - the quality of nutch code allowed us to build a proptotype quickly and integrate our code easily.

Search engine runs on Opteron boxes with Linux and JVM 1.5 (64-bit version). Web search engine is based on nutch code with some modifications. Current solution uses latest patch for usage of host name and title in ranking. We use customized set of boosts for fields (I will send a separate email about it as I have promised some time ago). We have our own implementation of WebDB - based on mysql. It was good for our purposes as it allowed to easily integrate classification of pages and additional information we needed but as we want to grow our index size we will have performance problems - so we will have to change it in near future (we are interested in map reduce implementation here). Classification of pages during fetching was done using Support Vector Machines.

It is released as beta to allow users to interact with it but there is still a lot of work to do especially in areas of relevancy and spam removal. Changes would be intoduced gradually in following months.

I will add a link to our search engine to nutch Wiki as soon as it will be fully transfered to Apache to avoid problems in the middle of transition.

Once again, thank you all for high quality search engine and I am looking forward to use nutch in future,
Regards
Piotr Kosiorowski
Senior Software Developer
Travel Search Technologies
Sabre Holdings
PS. For interested Sabre Holdings press release is here:
    http://www.forbes.com/home/feeds/ap/2005/04/05/ap1925698.html
---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net
-----------information technology-------------------
company:     http://www.media-style.com
forum:           http://www.text-mining.org
blog:                 http://www.find23.net

Re: Nutch - new public server

Reply via email to