Matt Quail wrote:

Hi all,

I'm doing a presentation to my local JUG on Lucene, and I'm looking for a "good" set of documents to use as a demonstration.

Ideally it would be:
1) large (10,000 plus?).
2) contain some metadata besides "body" (like author, date, primarykey, etc).
3) freely available.


I was going to use the data from the previous Google programming contest, but it doesn't seem to be available.

If I can't find anything satisfactory, I'll probably:
- generate a fake whitepages phonebook
- grab documents from project Gutenberg

My preference is for some "real" data, but I'm happy to generate fake data if no-one has any better ideas.


how about http://dmoz.org/rdf, and specifically content.rdf.u8.gz? You can find a parser/converter in Nutch for this format, but it's trivial to do it yourself - so long as you use SAX... (unless, of course, you run it on Cray or something.. :-) )



-- Best regards, Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to