I've always been
curious to see how traditional IR algorithms, based on TF-IDF, can be
applied to search on the Web which holds a totally different topology than a
flat document base. Because of the particular topology of the web, algorithms
such as Google's page rank, based on link popularity, tend to return the most
representative document WITHIN a site with anchors or content containing the
keyword searched. Normally when a company name is searched, this pin-points to
the most referenced URL, typically the company home page, though it may not be
the one that contains the most occurrences of the companies name (i.e. a search
for "Toyota" yields "Toyota.com" at the top in Google).
This also
avoids getting too many hits from the same site, just because, the word is very
common within the site. This problem becomes very obvious when you search for
"Toyota" at mozdex.com. Apart form being of lower the rank rank than
expected (you have to go to the end of page 1 and then page2 to get
documents from main company's website in the US) there are many many hits from
the Toyota.com site (arguably one for each type of car they have;-). This
is because of the obvious high Term Frequency (i.e. Toyota occurs
everywhere within Toyota.com).
Is it possible to
create a ranking algorithm that could treat a site as a WHOLE, while
still pin-pointing the most relevant document within it based on the query
terms? Has anyone considered things such as SITE-BASED TF
and IDF? Maybe a good way to pinpoint the best document within the
site looking at the internal topology (which the crawlers knows), without having
to computing an expensive overall page-rank calculation?
Just my two cents
regarding relevance testing of NUTCH.
__________________________
Joaquin Delgado, PhD.
Chief Technology Officer
TripleHop Technologies, Inc.
Office: (212) 243-4645, ext. 405
Cell: (646) 342-4880
45 West 25th Street, 9th floor (6th Ave.)
New York, NY 10010
www.TripleHop.com
Joaquin Delgado, PhD.
Chief Technology Officer
TripleHop Technologies, Inc.
Office: (212) 243-4645, ext. 405
Cell: (646) 342-4880
45 West 25th Street, 9th floor (6th Ave.)
New York, NY 10010
www.TripleHop.com
