On May 29, 2006, at 10:58 AM, Andrzej Bialecki wrote:
Has anyone used existing categorization data associated with the
Reuters corpus to build a benchmarker that measured IR precision
and/or recall?
That would be RCV1 or RCV2, right? AFAIK the Reuters-21578 has no
such information ... The use of RCV1/RCV2 is subject to a more
stringent license than Reuters-21578, so that few people would be
able to actually run the benchmarks.
21578 has categorization information. Here's a snippet from one of
the SGML files (note the TOPICS tag):
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET"
OLDID="5562" NEWID="19">
<DATE>26-FEB-1987 15:26:54.12</DATE>
<TOPICS><D>wheat</D><D>grain</D></TOPICS>
<PLACES><D>yemen-arab-republic</D><D>usa</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN>
C G
f0798reute
u f BC-/BONUS-WHEAT-FLOUR-FO 02-26 0096</UNKNOWN>
<TEXT>
<TITLE>BONUS WHEAT FLOUR FOR NORTH YEMEN -- USDA</TITLE>
I'm not sure how to use this info, though -- I'm just investigating
whether there's prior art before I start thinking hard about it.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]