I found "Mining the web - discovering knowledge from Hypertext Data"
by Soumen Ckakrabarti a usefull reference.

http://www.amazon.com/gp/product/1558607544/103-9548474-1631829?v=glance&n=283155

Rgrds, Thomas

On 8/29/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> Mladen Adamovic wrote:
> > Hi!
> >
> > I want to get more insight into various search engine algorithms. I
> > have wide knowledge of standard data structures & algorithms
> > (hashvalues, trees,  graphs, etc.). I thought that Lucene would be
> > good place to start to seek for information and indeed I've found some
> > decent information at Nutch website. However, I decided to post here
> > some personal opinions regarding this issue thinking that someone
> > might give me even more information.
> >
> > As far as I understand I should read books about Informational
> > Retrieval (i.e. Modern Information Retrieval by Balza-Yates,
> > Ribero-Neto). Any update?
> >
> > I also found using one article about link spam and citeseer wide
> > articles about link spam techniques, namely:
> > 1. Undue Influence: Eliminating the Impact of Link Plagiarism on Web
> > Search Rankings
> > 2. Using Rank Propagation and Probabilistic Counting for LinkBased
> > Spam Detection
> > 3. SpamRank   Fully Automatic Link Spam Detection
> > 4. Identifying Link Farm Spam Pages
> > 5. Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam
>
> Yes, good references. At this moment most of my working knowledge about
> search engines comes either from the book you cited above, or from
> papers found on Citeseer - play around with IR related terms, you will
> find a LOT of papers to read... ;). And then follow references from
> those papers ...
>
> I also found that other printed books are either too outdated or not so
> relevant to web-scale IR.
>
> In the end (as usually) the best way to really dig into the subject is
> to try and solve a real-life problem, combining the tools you already
> have and what you have learned.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to