Re: Open Relevance Project?

Andrzej Bialecki Mon, 18 May 2009 03:00:32 -0700

André Warnier wrote:

Hi.
There has been an erlier suggestion here, later endorsed by someoneelse, to use the documentation of the Apache projects as a corpus.Being far from an expert, I am just naively wondering why the experts onthis list seem to totally ignore it, without providing any argument.Is it somehow unsuitable, unpractical, inappropriate, bad, unfeasible,useless, uninteresting or ... ?

The documentation is mostly on a single topic - programming. Thevocabulary is, let's not deceive ourselves, limited ;) Pages contain alot of noise (Forrest navigation, javadoc dressing, common class names,code snippets, etc).

For a general-purpose corpus you would want to have several topics, witha well-balanced representation, and using a broad vocabulary and lowlevel of noise.

Additionally, this collection gets relatively little endorsement (linkswith meaningful anchors) from within apache.org, so the typical PageRankscoring wouldn't work too well (on the other hand, it resembles intranetlinkage, so it could be useful for studying scoring algos for enterprisesearch).


So, while this collection is not useless, it's not the best fit either.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Open Relevance Project?

Reply via email to