And what about Project Gutenburg?

Wikipedia is going to have relatively short text, Gutenburg very long.

-----Original Message-----
From: Steven Parkes [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 23, 2007 2:37 PM
To: java-dev@lucene.apache.org
Subject: RE: [jira] Commented: (LUCENE-845) If you "flush by RAM usage"
then IndexWriter may over-merge

Well, since I want to look at the impact of merge policy, I'll look into
this.

Wikipedia is easy to download (bandwidth notwithdstanding). The bz2'd of
the current English pages is 2.1G. That's certainly a lot of data. It
looks like the English is about 1.8M docs.  All languages is something
like 21M now.

I was also thinking of the TREC data but that seems hard to come by?

-----Original Message-----
From: Grant Ingersoll [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 23, 2007 1:09 PM
To: java-dev@lucene.apache.org
Subject: Re: [jira] Commented: (LUCENE-845) If you "flush by RAM usage"
then IndexWriter may over-merge

Yeah, I didn't play yet with millions of documents.  We will need a  
bigger test collection, I think!  Although the benchmarker can add as  
many as you want from the same source, index compression will effect  
the results possibly more than a bigger collection with all unique docs.

Maybe it is time to look at adding Wikipedia as a test collection.  I  
think there are something like 18+ million docs in it.

On Mar 23, 2007, at 4:01 PM, Doug Cutting wrote:

> Michael McCandless wrote:
>> Also, one caveat: whenever #docs (21578 for Reuters) divided by
>> maxBuffered docs is less than mergeFactor, you will have no merges
>> take place during your runs.  This greatly skews the results.
>
> Also, my guess is that this index fits entirely in the buffer  
> cache. Things behave quite differently when segments are larger  
> than available memory and merging requires lots of disk i/o.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to