I found "Mining the web - discovering knowledge from Hypertext Data" by Soumen Ckakrabarti a usefull reference.
http://www.amazon.com/gp/product/1558607544/103-9548474-1631829?v=glance&n=283155 Rgrds, Thomas On 8/29/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Mladen Adamovic wrote: > > Hi! > > > > I want to get more insight into various search engine algorithms. I > > have wide knowledge of standard data structures & algorithms > > (hashvalues, trees, graphs, etc.). I thought that Lucene would be > > good place to start to seek for information and indeed I've found some > > decent information at Nutch website. However, I decided to post here > > some personal opinions regarding this issue thinking that someone > > might give me even more information. > > > > As far as I understand I should read books about Informational > > Retrieval (i.e. Modern Information Retrieval by Balza-Yates, > > Ribero-Neto). Any update? > > > > I also found using one article about link spam and citeseer wide > > articles about link spam techniques, namely: > > 1. Undue Influence: Eliminating the Impact of Link Plagiarism on Web > > Search Rankings > > 2. Using Rank Propagation and Probabilistic Counting for LinkBased > > Spam Detection > > 3. SpamRank Fully Automatic Link Spam Detection > > 4. Identifying Link Farm Spam Pages > > 5. Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam > > Yes, good references. At this moment most of my working knowledge about > search engines comes either from the book you cited above, or from > papers found on Citeseer - play around with IR related terms, you will > find a LOT of papers to read... ;). And then follow references from > those papers ... > > I also found that other printed books are either too outdated or not so > relevant to web-scale IR. > > In the end (as usually) the best way to really dig into the subject is > to try and solve a real-life problem, combining the tools you already > have and what you have learned. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > > ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
