Steven Parkes wrote:
And what about Project Gutenburg?

Wikipedia is going to have relatively short text, Gutenburg very long.

Very long documents are useful for testing for anomalies, but they're not so useful as retrieved documents, nor typical of applications. Very long hits are awkward for users. Book search engines usually operate best either by breaking texts into small units (chapters, pages, overlapping windows, etc.) and searching those rather than the entire work, perhaps merging multiple hits from the same work in displayed results. (See, e.g., California Digital Library's XTF system, built by Kirk Hastings using Lucene. http://www.cdlib.org/inside/projects/xtf/)

I think Wikipedia is a much more typical use of Lucene.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to