I've always been curious to see how traditional IR algorithms, based on TF-IDF, can be applied to search on the Web which holds a totally different topology than a flat document base. Because of the particular topology of the web, algorithms such as Google's page rank, based on link popularity, tend to return the most representative document WITHIN a site with anchors or content containing the keyword searched. Normally when a company name is searched, this pin-points to the most referenced URL, typically the company home page, though it may not be the one that contains the most occurrences of the companies name (i.e. a search for "Toyota" yields  "Toyota.com" at the top in Google). This also avoids getting too many hits from the same site, just because, the word is very common within the site. This problem becomes very obvious when you search for "Toyota" at mozdex.com. Apart form being of lower the rank rank than expected (you have to go to the end of page 1 and then page2 to get documents from main company's website in the US) there are many many hits from the  Toyota.com site (arguably one for each type of car they have;-). This is because of the obvious high Term Frequency (i.e. Toyota occurs everywhere within Toyota.com).
 
Is it possible to create a ranking algorithm that could treat a site as a WHOLE, while still pin-pointing the most relevant document within it based on the query terms? Has anyone considered things such as SITE-BASED TF and IDF? Maybe a good way to pinpoint the best document within the site looking at the internal topology (which the crawlers knows), without having to computing an expensive overall page-rank calculation?
 
Just my two cents regarding relevance testing of NUTCH.
 
__________________________

Joaquin Delgado, PhD.
Chief Technology Officer
TripleHop Technologies, Inc.
Office: (212) 243-4645, ext. 405
Cell: (646) 342-4880
45 West 25th Street, 9th floor (6th Ave.)
New York, NY 10010
www.TripleHop.com

 

 

Reply via email to