On May 29, 2006, at 10:58 AM, Andrzej Bialecki wrote:

Has anyone used existing categorization data associated with the Reuters corpus to build a benchmarker that measured IR precision and/or recall?

That would be RCV1 or RCV2, right? AFAIK the Reuters-21578 has no such information ... The use of RCV1/RCV2 is subject to a more stringent license than Reuters-21578, so that few people would be able to actually run the benchmarks.

21578 has categorization information. Here's a snippet from one of the SGML files (note the TOPICS tag):

<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5562" NEWID="19">
<DATE>26-FEB-1987 15:26:54.12</DATE>
<TOPICS><D>wheat</D><D>grain</D></TOPICS>
<PLACES><D>yemen-arab-republic</D><D>usa</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN>
&#5;&#5;&#5;C G
&#22;&#22;&#1;f0798&#31;reute
u f BC-/BONUS-WHEAT-FLOUR-FO   02-26 0096</UNKNOWN>
<TEXT>&#2;
<TITLE>BONUS WHEAT FLOUR FOR NORTH YEMEN  -- USDA</TITLE>

I'm not sure how to use this info, though -- I'm just investigating whether there's prior art before I start thinking hard about it.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to