RE: [Nutch-dev] Sites vs. Documents

. . Fri, 28 May 2004 09:38:45 -0700

Hi

I believe this can be done with group results (pages) by sites.

Paul

Hotmail.com FREE EMAIL

From: "Joaquin Delgado" <[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] To: <[EMAIL PROTECTED]> CC: "'IT'" <[EMAIL PROTECTED]> Subject: [Nutch-dev] Sites vs. Documents Date: Fri, 28 May 2004 11:30:40 -0400 MIME-Version: 1.0 Received: from sc8-sf-list2.sourceforge.net ([66.35.250.206]) by mc11-f12.hotmail.com with Microsoft SMTPSVC(5.0.2195.6824); Fri, 28 May 2004 08:29:23 -0700 Received: from localhost ([127.0.0.1] helo=projects.sourceforge.net)by sc8-sf-list2.sourceforge.net with esmtp (Exim 4.30)id 1BTjIF-0001gh-3L; Fri, 28 May 2004 08:29:11 -0700 Received: from sc8-sf-mx2-b.sourceforge.net ([10.3.1.12] helo=sc8-sf-mx2.sourceforge.net)by sc8-sf-list2.sourceforge.net with esmtp (Exim 4.30)id 1BTjH7-0001QJ-Vkfor [EMAIL PROTECTED]; Fri, 28 May 2004 08:28:01 -0700 Received: from [216.74.150.80] (helo=germany.prod.thop-ny.triplehop.com)by sc8-sf-mx2.sourceforge.net with esmtp (Exim 4.30)id 1BTjH7-0000kz-F6for [EMAIL PROTECTED]; Fri, 28 May 2004 08:28:01 -0700 Received: from JoaquinD ([208.246.29.6]) by germany.prod.thop-ny.triplehop.com with Microsoft SMTPSVC(5.0.2195.6713); Fri, 28 May 2004 11:26:43 -0400 X-Message-Info: QIy1oIULmHeZ0Z8YVLJR2qGyAfVDCUeR Organization: TripleHop Technologies Inc. X-Mailer: Microsoft Office Outlook, Build 11.0.5510 Thread-Index: AcREyK9UfkOPT/nAT9mdsJFEWay8qQ== X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1409 Message-ID: <[EMAIL PROTECTED]> X-OriginalArrivalTime: 28 May 2004 15:27:41.0452 (UTC) FILETIME=[50E48CC0:01C444C8] X-Spam-Score: 0.6 (/) X-Spam-Report: Spam Filtering performed by sourceforge.net.See http://spamassassin.org/tag/ for more details.Report problems to http://sf.net/tracker/?func=add&group_id=1&atid=2000010.1 HTML_FONTCOLOR_BLUE BODY: HTML font color is blue0.0 HTML_MESSAGE BODY: HTML included in message0.5 HTML_20_30 BODY: Message is 20% to 30% HTML Errors-To: [EMAIL PROTECTED] X-BeenThere: [EMAIL PROTECTED] X-Mailman-Version: 2.0.9-sf.net Precedence: bulk X-Reply-To: <[EMAIL PROTECTED]> List-Unsubscribe: <https://lists.sourceforge.net/lists/listinfo/nutch-developers>,<mailto:[EMAIL PROTECTED]> List-Id: Discussions among nutch developers. <nutch-developers.lists.sourceforge.net> List-Post: <mailto:[EMAIL PROTECTED]> List-Help: <mailto:[EMAIL PROTECTED]> List-Subscribe: <https://lists.sourceforge.net/lists/listinfo/nutch-developers>,<mailto:[EMAIL PROTECTED]> List-Archive: <http://sourceforge.net/mailarchive/forum.php?forum=nutch-developers> Return-Path: [EMAIL PROTECTED]
I've always been curious to see how traditional IR algorithms, based on
TF-IDF, can be applied to search on the Web which holds a totally different
topology than a flat document base. Because of the particular topology of
the web, algorithms such as Google's page rank, based on link popularity,
tend to return the most representative document WITHIN a site with anchors
or content containing the keyword searched. Normally when a company name is
searched, this pin-points to the most referenced URL, typically the company
home page, though it may not be the one that contains the most occurrences
of the companies name (i.e. a search for "Toyota" yields  "Toyota.com" at
the top in Google). This also avoids getting too many hits from the same
site, just because, the word is very common within the site. This problem
becomes very obvious when you search for "Toyota" at mozdex.com. Apart form
being of lower the rank rank than expected (you have to go to the end of
page 1 and then page2 to get documents from main company's website in the
US) there are many many hits from the  Toyota.com site (arguably one for
each type of car they have;-). This is because of the obvious high Term
Frequency (i.e. Toyota occurs everywhere within Toyota.com).
Is it possible to create a ranking algorithm that could treat a site as a
WHOLE, while still pin-pointing the most relevant document within it based
on the query terms? Has anyone considered things such as SITE-BASED TF and
IDF? Maybe a good way to pinpoint the best document within the site looking
at the internal topology (which the crawlers knows), without having to
computing an expensive overall page-rank calculation?
Just my two cents regarding relevance testing of NUTCH.
__________________________
Joaquin Delgado, PhD.
Chief Technology Officer
TripleHop Technologies, Inc.
Office: (212) 243-4645, ext. 405
Cell: (646) 342-4880
45 West 25th Street, 9th floor (6th Ave.)
New York, NY 10010
 <http://www.triplehop.com/> www.TripleHop.com

_________________________________________________________________ MSN Premium with Virus Guard and Firewall* from McAfee� Security : 2 months FREE* http://join.msn.com/?pgmarket=en-ca&page=byoa/prem&xAPID=1994&DI=1034&SU=http://hotmail.com/enca&HL=Market_MSNIS_Taglines

------------------------------------------------------- This SF.Net email is sponsored by: Oracle 10g Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE. http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers

RE: [Nutch-dev] Sites vs. Documents

Reply via email to