Hi
I believe this can be done with group results (pages) by sites.
Paul
Hotmail.com FREE EMAIL
From: "Joaquin Delgado" <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
To: <[EMAIL PROTECTED]>
CC: "'IT'" <[EMAIL PROTECTED]>
Subject: [Nutch-dev] Sites vs. Documents
Date: Fri, 28 May 2004 11:30:40 -0400
MIME-Version: 1.0
Received: from sc8-sf-list2.sourceforge.net ([66.35.250.206]) by mc11-f12.hotmail.com with Microsoft SMTPSVC(5.0.2195.6824); Fri, 28 May 2004 08:29:23 -0700
Received: from localhost ([127.0.0.1] helo=projects.sourceforge.net)by sc8-sf-list2.sourceforge.net with esmtp (Exim 4.30)id 1BTjIF-0001gh-3L; Fri, 28 May 2004 08:29:11 -0700
Received: from sc8-sf-mx2-b.sourceforge.net ([10.3.1.12] helo=sc8-sf-mx2.sourceforge.net)by sc8-sf-list2.sourceforge.net with esmtp (Exim 4.30)id 1BTjH7-0001QJ-Vkfor [EMAIL PROTECTED]; Fri, 28 May 2004 08:28:01 -0700
Received: from [216.74.150.80] (helo=germany.prod.thop-ny.triplehop.com)by sc8-sf-mx2.sourceforge.net with esmtp (Exim 4.30)id 1BTjH7-0000kz-F6for [EMAIL PROTECTED]; Fri, 28 May 2004 08:28:01 -0700
Received: from JoaquinD ([208.246.29.6]) by germany.prod.thop-ny.triplehop.com with Microsoft SMTPSVC(5.0.2195.6713); Fri, 28 May 2004 11:26:43 -0400
X-Message-Info: QIy1oIULmHeZ0Z8YVLJR2qGyAfVDCUeR
Organization: TripleHop Technologies Inc.
X-Mailer: Microsoft Office Outlook, Build 11.0.5510
Thread-Index: AcREyK9UfkOPT/nAT9mdsJFEWay8qQ==
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1409
Message-ID: <[EMAIL PROTECTED]>
X-OriginalArrivalTime: 28 May 2004 15:27:41.0452 (UTC) FILETIME=[50E48CC0:01C444C8]
X-Spam-Score: 0.6 (/)
X-Spam-Report: Spam Filtering performed by sourceforge.net.See http://spamassassin.org/tag/ for more details.Report problems to http://sf.net/tracker/?func=add&group_id=1&atid=2000010.1 HTML_FONTCOLOR_BLUE BODY: HTML font color is blue0.0 HTML_MESSAGE BODY: HTML included in message0.5 HTML_20_30 BODY: Message is 20% to 30% HTML
Errors-To: [EMAIL PROTECTED]
X-BeenThere: [EMAIL PROTECTED]
X-Mailman-Version: 2.0.9-sf.net
Precedence: bulk
X-Reply-To: <[EMAIL PROTECTED]>
List-Unsubscribe: <https://lists.sourceforge.net/lists/listinfo/nutch-developers>,<mailto:[EMAIL PROTECTED]>
List-Id: Discussions among nutch developers. <nutch-developers.lists.sourceforge.net>
List-Post: <mailto:[EMAIL PROTECTED]>
List-Help: <mailto:[EMAIL PROTECTED]>
List-Subscribe: <https://lists.sourceforge.net/lists/listinfo/nutch-developers>,<mailto:[EMAIL PROTECTED]>
List-Archive: <http://sourceforge.net/mailarchive/forum.php?forum=nutch-developers>
Return-Path: [EMAIL PROTECTED]
I've always been curious to see how traditional IR algorithms, based on TF-IDF, can be applied to search on the Web which holds a totally different topology than a flat document base. Because of the particular topology of the web, algorithms such as Google's page rank, based on link popularity, tend to return the most representative document WITHIN a site with anchors or content containing the keyword searched. Normally when a company name is searched, this pin-points to the most referenced URL, typically the company home page, though it may not be the one that contains the most occurrences of the companies name (i.e. a search for "Toyota" yields "Toyota.com" at the top in Google). This also avoids getting too many hits from the same site, just because, the word is very common within the site. This problem becomes very obvious when you search for "Toyota" at mozdex.com. Apart form being of lower the rank rank than expected (you have to go to the end of page 1 and then page2 to get documents from main company's website in the US) there are many many hits from the Toyota.com site (arguably one for each type of car they have;-). This is because of the obvious high Term Frequency (i.e. Toyota occurs everywhere within Toyota.com).
Is it possible to create a ranking algorithm that could treat a site as a WHOLE, while still pin-pointing the most relevant document within it based on the query terms? Has anyone considered things such as SITE-BASED TF and IDF? Maybe a good way to pinpoint the best document within the site looking at the internal topology (which the crawlers knows), without having to computing an expensive overall page-rank calculation?
Just my two cents regarding relevance testing of NUTCH.
__________________________
Joaquin Delgado, PhD. Chief Technology Officer TripleHop Technologies, Inc. Office: (212) 243-4645, ext. 405 Cell: (646) 342-4880 45 West 25th Street, 9th floor (6th Ave.) New York, NY 10010 <http://www.triplehop.com/> www.TripleHop.com
_________________________________________________________________
MSN Premium with Virus Guard and Firewall* from McAfee� Security : 2 months FREE* http://join.msn.com/?pgmarket=en-ca&page=byoa/prem&xAPID=1994&DI=1034&SU=http://hotmail.com/enca&HL=Market_MSNIS_Taglines
-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
