I think the Reuters corpus is pretty good and it pretty well known in the community. Probably the most important part would be to build up a set of judgments. I don't think it is too hard to come up w/ 50-100 questions/queries, but creating the relevance pool will be more difficult. I suppose we could setup a social networking site to harvest judgments... :-)

The 4M queries would be good for load testing.

Wikipedia stuff is good, but you need to be able to handle/remove the redirects, otherwise you have a tendency to get redirect pages as your top matches due to length normalization. Plus it is really big to download.


On Mar 20, 2007, at 6:58 AM, Karl Wettin (JIRA) wrote:


[ https://issues.apache.org/jira/browse/LUCENE-836? page=com.atlassian.jira.plugin.system.issuetabpanels:comment- tabpanel#action_12482367 ]

Karl Wettin commented on LUCENE-836:
------------------------------------

Regarding data and user queries, I have a 150 000 document corpus with 4 000 000 queries that I might be able to convince the owners to release. It is great data, but a bit politically incorrect (torrents).

There is some simple Wikipedia harvesting in LUCENE-826, and I'm in the middle of rewriting it to a more general Wikipedia library for text mining purposes. Perhaps you have some ideas you want to put in there? I plan something like this:

public class WikipediaCorpus {
  Map<String, String> wikipediaDomainPrefixByLanguageISO
  Map<URL, WikipediaArticle> harvestedArticle

public WikipediaArticle getArticle(String languageISO, String title) {
    ..
  }
}

public class WikipediaArticle {
  WikipediaArticle(URL url) {
    ..
  }

  String languageISO;
  String title;
  String[] contentParagraphs

  Date[] modified;

  Map<String, String> articleInOtherLanguagesByLanguageISO

}



Benchmarks Enhancements (precision/recall, TREC, Wikipedia)
-----------------------------------------------------------

                Key: LUCENE-836
                URL: https://issues.apache.org/jira/browse/LUCENE-836
            Project: Lucene - Java
         Issue Type: New Feature
         Components: Other
           Reporter: Grant Ingersoll
           Priority: Minor

Would be great if the benchmark contrib had a way of providing precision/recall benchmark information ala TREC. I don't know what the copyright issues are for the TREC queries/data (I think the queries are available, but not sure about the data), so not sure if the is even feasible, but I could imagine we could at least incorporate support for it for those who have access to the data. It has been a long time since I have participated in TREC, so perhaps someone more familiar w/ the latest can fill in the blanks here. Another option is to ask for volunteers to create queries and make judgments for the Reuters data, but that is a bit more complex and probably not necessary. Even so, an Apache licensed set of benchmarks may be useful for the community as a whole. Hmmm.... Wikipedia might be another option instead of Reuters to setup as a download for benchmarking, as it is quite large and I believe the licensing terms are quite amenable. Having a larger collection would be good for stressing Lucene more and would give many users a demonstration of how Lucene handles large collections. At any rate, this kind of information could be useful for people looking at different indexing schemes, formats, payloads and different query strategies.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to