Re: [jira] Commented: (LUCENE-836) Benchmarks Enhancements (precision/recall, TREC, Wikipedia)

Grant Ingersoll Tue, 20 Mar 2007 05:31:26 -0800

I think the Reuters corpus is pretty good and it pretty well known inthe community. Probably the most important part would be to build upa set of judgments. I don't think it is too hard to come up w/50-100 questions/queries, but creating the relevance pool will bemore difficult. I suppose we could setup a social networking site toharvest judgments... :-)


The 4M queries would be good for load testing.

Wikipedia stuff is good, but you need to be able to handle/remove theredirects, otherwise you have a tendency to get redirect pages asyour top matches due to length normalization. Plus it is really bigto download.



On Mar 20, 2007, at 6:58 AM, Karl Wettin (JIRA) wrote:

[ https://issues.apache.org/jira/browse/LUCENE-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12482367 ]
Karl Wettin commented on LUCENE-836:
------------------------------------
Regarding data and user queries, I have a 150 000 document corpuswith 4 000 000 queries that I might be able to convince the ownersto release. It is great data, but a bit politically incorrect(torrents).
There is some simple Wikipedia harvesting in LUCENE-826, and I'm inthe middle of rewriting it to a more general Wikipedia library fortext mining purposes. Perhaps you have some ideas you want to putin there? I plan something like this:
public class WikipediaCorpus {
  Map<String, String> wikipediaDomainPrefixByLanguageISO
  Map<URL, WikipediaArticle> harvestedArticle
public WikipediaArticle getArticle(String languageISO, Stringtitle) {
    ..
  }
}

public class WikipediaArticle {
  WikipediaArticle(URL url) {
    ..
  }

  String languageISO;
  String title;
  String[] contentParagraphs

  Date[] modified;

  Map<String, String> articleInOtherLanguagesByLanguageISO

}
Benchmarks Enhancements (precision/recall, TREC, Wikipedia)
-----------------------------------------------------------

                Key: LUCENE-836
                URL: https://issues.apache.org/jira/browse/LUCENE-836
            Project: Lucene - Java
         Issue Type: New Feature
         Components: Other
           Reporter: Grant Ingersoll
           Priority: Minor
Would be great if the benchmark contrib had a way of providingprecision/recall benchmark information ala TREC. I don't knowwhat the copyright issues are for the TREC queries/data (I thinkthe queries are available, but not sure about the data), so notsure if the is even feasible, but I could imagine we could atleast incorporate support for it for those who have access to thedata. It has been a long time since I have participated in TREC,so perhaps someone more familiar w/ the latest can fill in theblanks here.Another option is to ask for volunteers to create queries and makejudgments for the Reuters data, but that is a bit more complex andprobably not necessary. Even so, an Apache licensed set ofbenchmarks may be useful for the community as a whole. Hmmm....Wikipedia might be another option instead of Reuters to setup as adownload for benchmarking, as it is quite large and I believe thelicensing terms are quite amenable. Having a larger collectionwould be good for stressing Lucene more and would give many usersa demonstration of how Lucene handles large collections.At any rate, this kind of information could be useful for peoplelooking at different indexing schemes, formats, payloads anddifferent query strategies.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-836) Benchmarks Enhancements (precision/recall, TREC, Wikipedia)

Reply via email to