Re: [Nutch-dev] Sites vs. Documents

Stefan Groschupf Fri, 28 May 2004 12:43:24 -0700

A interesting idea, but I think that wouldn't work well in real cases.
What's about a websites like blogger.com or geocity? There a set of individual "sites" and they have mostly no relations to each other.
Just group results do not brings the benefit Joaquin is searching for, from my point of view.

A very pragmatic solution would be to index the url as well and used it until relevance ranking may be with a configurable weight.
So Toyota.com and www.cars.com/toyota would have more relevance then a page that contains 100 times the world Toyota.
Ontology is such a other keyword coming in my mind thinking about this problem, i personal guess google is using something like that.

I'm not sure if that will work but we can just try it. I haven't time to try it yet, since I'm limited because of a other real time job, limited bandwidth and hardware. (that is a hidden signal ;) )

The other way i see, but this is much more difficult is usage of clustering for grouping a logical "site" together.
Content clustering or "design" pattern clustering could be useful for such a task.

Just my two cents. ;)
Stefan

Am 28.05.2004 um 17:30 schrieb Joaquin Delgado:

I've always been curious to see how traditional IR algorithms, based on TF-IDF,�can be applied to search on the Web which holds a totally different topology than a flat document base. Because of the particular topology of the web, algorithms such as Google's page rank, based on link popularity, tend to return the most representative document WITHIN a site with anchors or content containing the keyword searched. Normally when a company name is searched, this pin-points to the most referenced URL, typically the company home page, though it may not be the one that contains the most occurrences of the companies name (i.e. a search for "Toyota" yields��"Toyota.com" at the top in Google). This also avoids getting too many hits from the same site, just because, the word is very common within the site. This problem becomes very obvious when you search for "Toyota" at mozdex.com. Apart form being�of lower�the rank rank than expected�(you�have to go to the end of page 1 and then page2 to get documents from main company's website in the US) there are many many hits from the� Toyota.com site (arguably one for each type of car they have;-). This is because�of the obvious high�Term Frequency (i.e. Toyota occurs everywhere within Toyota.com).
�
Is it possible to create a ranking algorithm that could treat�a site as a�WHOLE, while still pin-pointing the most relevant document within it based on the query terms?�Has anyone considered things such�as�SITE-BASED�TF and IDF?�Maybe a good way to�pinpoint the best document within the site looking at the internal topology (which the crawlers knows), without having to computing an expensive overall page-rank calculation?
�
Just my two cents regarding relevance testing of NUTCH.
�
__________________________

Joaquin Delgado, PhD.
Chief Technology Officer
TripleHop Technologies, Inc.
Office: (212) 243-4645, ext. 405
Cell: (646) 342-4880
45 West 25th Street, 9th floor (6th Ave.)
New York, NY 10010
www.TripleHop.com

�
�

---------------------------------------------------------------
open technology: http://www.media-style.com
open source: http://www.weta-group.net
open discussion: http://www.text-mining.org

Re: [Nutch-dev] Sites vs. Documents

Reply via email to