[ https://issues.apache.org/jira/browse/LUCENE-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12482367 ]
Karl Wettin commented on LUCENE-836: ------------------------------------ Regarding data and user queries, I have a 150 000 document corpus with 4 000 000 queries that I might be able to convince the owners to release. It is great data, but a bit politically incorrect (torrents). There is some simple Wikipedia harvesting in LUCENE-826, and I'm in the middle of rewriting it to a more general Wikipedia library for text mining purposes. Perhaps you have some ideas you want to put in there? I plan something like this: public class WikipediaCorpus { Map<String, String> wikipediaDomainPrefixByLanguageISO Map<URL, WikipediaArticle> harvestedArticle public WikipediaArticle getArticle(String languageISO, String title) { .. } } public class WikipediaArticle { WikipediaArticle(URL url) { .. } String languageISO; String title; String[] contentParagraphs Date[] modified; Map<String, String> articleInOtherLanguagesByLanguageISO } > Benchmarks Enhancements (precision/recall, TREC, Wikipedia) > ----------------------------------------------------------- > > Key: LUCENE-836 > URL: https://issues.apache.org/jira/browse/LUCENE-836 > Project: Lucene - Java > Issue Type: New Feature > Components: Other > Reporter: Grant Ingersoll > Priority: Minor > > Would be great if the benchmark contrib had a way of providing > precision/recall benchmark information ala TREC. I don't know what the > copyright issues are for the TREC queries/data (I think the queries are > available, but not sure about the data), so not sure if the is even feasible, > but I could imagine we could at least incorporate support for it for those > who have access to the data. It has been a long time since I have > participated in TREC, so perhaps someone more familiar w/ the latest can fill > in the blanks here. > Another option is to ask for volunteers to create queries and make judgments > for the Reuters data, but that is a bit more complex and probably not > necessary. Even so, an Apache licensed set of benchmarks may be useful for > the community as a whole. Hmmm.... > Wikipedia might be another option instead of Reuters to setup as a download > for benchmarking, as it is quite large and I believe the licensing terms are > quite amenable. Having a larger collection would be good for stressing > Lucene more and would give many users a demonstration of how Lucene handles > large collections. > At any rate, this kind of information could be useful for people looking at > different indexing schemes, formats, payloads and different query strategies. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]