[ 
https://issues.apache.org/jira/browse/LUCENE-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12482367
 ] 

Karl Wettin commented on LUCENE-836:
------------------------------------

Regarding data and user queries, I have a 150 000 document corpus with 4 000 
000 queries that I might be able to convince the owners to release. It is great 
data, but a bit politically incorrect (torrents). 

There is some simple Wikipedia harvesting in LUCENE-826, and I'm in the middle 
of rewriting it to a more general Wikipedia library for text mining purposes. 
Perhaps you have some ideas you want to put in there? I plan something like 
this:

public class WikipediaCorpus {  
  Map<String, String> wikipediaDomainPrefixByLanguageISO
  Map<URL, WikipediaArticle> harvestedArticle

  public WikipediaArticle getArticle(String languageISO, String title) {
    ..
  }
}

public class WikipediaArticle {
  WikipediaArticle(URL url) {
    ..
  }
 
  String languageISO;
  String title;
  String[] contentParagraphs

  Date[] modified; 

  Map<String, String> articleInOtherLanguagesByLanguageISO

}



> Benchmarks Enhancements (precision/recall, TREC, Wikipedia)
> -----------------------------------------------------------
>
>                 Key: LUCENE-836
>                 URL: https://issues.apache.org/jira/browse/LUCENE-836
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Other
>            Reporter: Grant Ingersoll
>            Priority: Minor
>
> Would be great if the benchmark contrib had a way of providing 
> precision/recall benchmark information ala TREC.  I don't know what the 
> copyright issues are for the TREC queries/data (I think the queries are 
> available, but not sure about the data), so not sure if the is even feasible, 
> but I could imagine we could at least incorporate support for it for those 
> who have access to the data.  It has been a long time since I have 
> participated in TREC, so perhaps someone more familiar w/ the latest can fill 
> in the blanks here.
> Another option is to ask for volunteers to create queries and make judgments 
> for the Reuters data, but that is a bit more complex and probably not 
> necessary.  Even so, an Apache licensed set of benchmarks may be useful for 
> the community as a whole.  Hmmm.... 
> Wikipedia might be another option instead of Reuters to setup as a download 
> for benchmarking, as it is quite large and I believe the licensing terms are 
> quite amenable.  Having a larger collection would be good for stressing 
> Lucene more and would give many users a demonstration of how Lucene handles 
> large collections.
> At any rate, this kind of information could be useful for people looking at 
> different indexing schemes, formats, payloads and different query strategies.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to