Steven Parkes wrote:
And what about Project Gutenburg?
Wikipedia is going to have relatively short text, Gutenburg very long.
Very long documents are useful for testing for anomalies, but they're
not so useful as retrieved documents, nor typical of applications. Very
long hits are awkward for users. Book search engines usually operate
best either by breaking texts into small units (chapters, pages,
overlapping windows, etc.) and searching those rather than the entire
work, perhaps merging multiple hits from the same work in displayed
results. (See, e.g., California Digital Library's XTF system, built by
Kirk Hastings using Lucene. http://www.cdlib.org/inside/projects/xtf/)
I think Wikipedia is a much more typical use of Lucene.
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]