RE: [jira] Commented: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

Steven Parkes Tue, 24 Apr 2007 16:16:33 -0700

You're right. Shows you my mind's been on performance, rather than
scoring.


My interpretation of the license was that it was pretty broad and we
could host a fixed copy if we wanted to. But I'm not a lawyer, so ... 

-----Original Message-----
From: Grant Ingersoll [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, April 24, 2007 3:54 PM
To: [email protected]
Subject: Re: [jira] Commented: (LUCENE-848) Add supported for Wikipedia
English as a corpus in the benchmarker stuff

Well, there would be issues if we tried to do precision/recall/ 
quality of results benchmarks on it, I think.  I can ask on legal- 
discuss.  This is the license I found: http://en.wikipedia.org/wiki/ 
Wikipedia:Text_of_the_GFDL found via http://en.wikipedia.org/wiki/ 
Wikipedia:Database_download



On Apr 24, 2007, at 6:01 PM, Steven Parkes wrote:

> They don't seem to keep things around too long. There were more files
> available when I downloaded earlier this month, but they're already
> gone.
>
> Wikipedia is supposed to only contain stuff covered by the GNU Free
> Documentation License so saving it should be okay. In fact, one of the
> other files you can download has all the revisions of all the  
> documents.
>
> The issue of different versions is a good one. I wonder how much it
> matters for reasonably big datasets. Not that much of the data  
> changes,
> I suspect.
>
> For grins, I think I'll download the newer snapshot and see if there's
> any difference for the ingest tests I've done.
>
> -----Original Message-----
> From: Grant Ingersoll [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, April 24, 2007 1:50 PM
> To: [email protected]
> Subject: Re: [jira] Commented: (LUCENE-848) Add supported for  
> Wikipedia
> English as a corpus in the benchmarker stuff
>
> Is there a way to pick a specific day, versus "latest".  How long
> does Wikipedia archive?  Always using the latest makes comparisons
> more difficult.  I wonder if licensing terms would allow us to host a
> specific date of the version on Lucene zones.  Of course, that may
> not be a good idea bandwidth wise.  I'm open to suggestions.  Maybe
> using the latest isn't that big of a deal.
>
>
>
> On Apr 24, 2007, at 2:45 PM, Steven Parkes (JIRA) wrote:
>
>>
>>     [ https://issues.apache.org/jira/browse/LUCENE-848?
>> page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
>> tabpanel#action_12491396 ]
>>
>> Steven Parkes commented on LUCENE-848:
>> --------------------------------------
>>
>> Yeah, it takes a while to download.
>>
>> I added the jars since that's what we've been doing elsewhere. In
>> fact, xerces is in gdata-server too. Personally, the size isn't an
>> issue for me; don't know about others.  What might be difficult,
>> though, is trying to share the two since that would mean
>> coordinating contrib projects, and I don't know anything about the
>> gdata server. I can tell you that if you want to support both 1.4
>> and 1.5 on something as big wikipedia, there is sensitivity to the
>> xerces revision.
>>
>> Sorry about the download problem, Grant. I actually documented that
>> in a readme ... hat I can no longer find. I would swear I put it in
>> the patch but obviously I didn't becuase it's not there. Now I have
>> to go find it.
>>
>> The short answer is you want to download  http://
>> download.wikimedia.org/enwiki/20070402/enwiki-20070402-pages-
>> articles.xml.bz2. The wikipedia download site isn't always clean,
>> doesn't have files where they "should" be. It was when I first
>> started this, but isn't now.
>>
>>> Add supported for Wikipedia English as a corpus in the benchmarker
>>> stuff
>>> --------------------------------------------------------------------

>>> -
>
>>> ---
>>>
>>>                 Key: LUCENE-848
>>>                 URL: https://issues.apache.org/jira/browse/ 
>>> LUCENE-848
>>>             Project: Lucene - Java
>>>          Issue Type: New Feature
>>>          Components: contrib/benchmark
>>>            Reporter: Steven Parkes
>>>         Assigned To: Grant Ingersoll
>>>            Priority: Minor
>>>             Fix For: 2.2
>>>
>>>         Attachments: LUCENE-848.txt, LUCENE-848.txt,
>>> LUCENE-848.txt, LUCENE-848.txt, WikipediaHarvester.java,
>>> xerces.jar, xerces.jar, xml-apis.jar
>>>
>>>
>>> Add support for using Wikipedia for benchmarking.
>>
>> -- 
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
> http://lucene.grantingersoll.com
> http://www.paperoftheweek.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: [jira] Commented: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

Reply via email to