André Warnier wrote:
Hi.
There has been an erlier suggestion here, later endorsed by someone
else, to use the documentation of the Apache projects as a corpus.
Being far from an expert, I am just naively wondering why the experts on
this list seem to totally ignore it, without providing any argument.
Is it somehow unsuitable, unpractical, inappropriate, bad, unfeasible,
useless, uninteresting or ... ?
The documentation is mostly on a single topic - programming. The
vocabulary is, let's not deceive ourselves, limited ;) Pages contain a
lot of noise (Forrest navigation, javadoc dressing, common class names,
code snippets, etc).
For a general-purpose corpus you would want to have several topics, with
a well-balanced representation, and using a broad vocabulary and low
level of noise.
Additionally, this collection gets relatively little endorsement (links
with meaningful anchors) from within apache.org, so the typical PageRank
scoring wouldn't work too well (on the other hand, it resembles intranet
linkage, so it could be useful for studying scoring algos for enterprise
search).
So, while this collection is not useless, it's not the best fit either.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com