You're right. Shows you my mind's been on performance, rather than scoring.
My interpretation of the license was that it was pretty broad and we could host a fixed copy if we wanted to. But I'm not a lawyer, so ... -----Original Message----- From: Grant Ingersoll [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 24, 2007 3:54 PM To: java-dev@lucene.apache.org Subject: Re: [jira] Commented: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff Well, there would be issues if we tried to do precision/recall/ quality of results benchmarks on it, I think. I can ask on legal- discuss. This is the license I found: http://en.wikipedia.org/wiki/ Wikipedia:Text_of_the_GFDL found via http://en.wikipedia.org/wiki/ Wikipedia:Database_download On Apr 24, 2007, at 6:01 PM, Steven Parkes wrote: > They don't seem to keep things around too long. There were more files > available when I downloaded earlier this month, but they're already > gone. > > Wikipedia is supposed to only contain stuff covered by the GNU Free > Documentation License so saving it should be okay. In fact, one of the > other files you can download has all the revisions of all the > documents. > > The issue of different versions is a good one. I wonder how much it > matters for reasonably big datasets. Not that much of the data > changes, > I suspect. > > For grins, I think I'll download the newer snapshot and see if there's > any difference for the ingest tests I've done. > > -----Original Message----- > From: Grant Ingersoll [mailto:[EMAIL PROTECTED] > Sent: Tuesday, April 24, 2007 1:50 PM > To: java-dev@lucene.apache.org > Subject: Re: [jira] Commented: (LUCENE-848) Add supported for > Wikipedia > English as a corpus in the benchmarker stuff > > Is there a way to pick a specific day, versus "latest". How long > does Wikipedia archive? Always using the latest makes comparisons > more difficult. I wonder if licensing terms would allow us to host a > specific date of the version on Lucene zones. Of course, that may > not be a good idea bandwidth wise. I'm open to suggestions. Maybe > using the latest isn't that big of a deal. > > > > On Apr 24, 2007, at 2:45 PM, Steven Parkes (JIRA) wrote: > >> >> [ https://issues.apache.org/jira/browse/LUCENE-848? >> page=com.atlassian.jira.plugin.system.issuetabpanels:comment- >> tabpanel#action_12491396 ] >> >> Steven Parkes commented on LUCENE-848: >> -------------------------------------- >> >> Yeah, it takes a while to download. >> >> I added the jars since that's what we've been doing elsewhere. In >> fact, xerces is in gdata-server too. Personally, the size isn't an >> issue for me; don't know about others. What might be difficult, >> though, is trying to share the two since that would mean >> coordinating contrib projects, and I don't know anything about the >> gdata server. I can tell you that if you want to support both 1.4 >> and 1.5 on something as big wikipedia, there is sensitivity to the >> xerces revision. >> >> Sorry about the download problem, Grant. I actually documented that >> in a readme ... hat I can no longer find. I would swear I put it in >> the patch but obviously I didn't becuase it's not there. Now I have >> to go find it. >> >> The short answer is you want to download http:// >> download.wikimedia.org/enwiki/20070402/enwiki-20070402-pages- >> articles.xml.bz2. The wikipedia download site isn't always clean, >> doesn't have files where they "should" be. It was when I first >> started this, but isn't now. >> >>> Add supported for Wikipedia English as a corpus in the benchmarker >>> stuff >>> -------------------------------------------------------------------- >>> - > >>> --- >>> >>> Key: LUCENE-848 >>> URL: https://issues.apache.org/jira/browse/ >>> LUCENE-848 >>> Project: Lucene - Java >>> Issue Type: New Feature >>> Components: contrib/benchmark >>> Reporter: Steven Parkes >>> Assigned To: Grant Ingersoll >>> Priority: Minor >>> Fix For: 2.2 >>> >>> Attachments: LUCENE-848.txt, LUCENE-848.txt, >>> LUCENE-848.txt, LUCENE-848.txt, WikipediaHarvester.java, >>> xerces.jar, xerces.jar, xml-apis.jar >>> >>> >>> Add support for using Wikipedia for benchmarking. >> >> -- >> This message is automatically generated by JIRA. >> - >> You can reply to this email to add a comment to the issue online. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> > > ------------------------------------------------------ > Grant Ingersoll > http://www.grantingersoll.com/ > http://lucene.grantingersoll.com > http://www.paperoftheweek.com/ > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -------------------------- Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ LuceneFAQ --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]