Re: instantiated contrib

Karl Wettin Fri, 27 Aug 2010 01:48:15 -0700

Can you tell us what your queries are? Is it simple term queries,phrases, fuzzy, etc?

I think the bad guy here is the term "hotel", that so many documentscontains it. You could try loading the full index to II and see howlong time it take to match just that term and compare. And then try amore unique term.

If this indeed is the bad guy here are two solutions that might workdepending on your needs:

Replace all queries for "hotel" with a filter and remove "hotel" fromyour term queries. This might decrease p/r for phrases etc.

Index and query using shingles to create more unique terms. This willalso improve p/r for phrases if you don't already do phrase queriesand if you do it should save you time at query time as you might nolonger need the phrase queries. But the index size will grow.

I'm sure there are other solutions too, but to be quite honest I thinkyour corpus is pushing the limit for when it make sense with an II. Myguess is that you can get it perhaps 2x faster than a RAMDirectory ifyou tune it right.




        karl

27 aug 2010 kl. 05.34 skrev Li Li:

if I index only 7k documents, the time comparison:
time1: 7602331019 time2: 4246878035 total1: 10736 total2: 7393
it seems II is faster than RAMDirectory.

My indexed texts are all hotel names (chinese and english, litterfrench).

it has about 100k terms. terms such as hotel is very frequent and
hotel name is very rare(exception Hotel Chain).
so I guess it's distribution is a litter term is very frequent and
other term is very rare.

2010/8/27 Karl Wettin <karl.wet...@gmail.com>:

My mail client died while sending this mail.. Sorry for anyduplicate.

It is strange that it should take 20 second to gather fields, thisis theonly thing that really suprises me. I'd expect it to be instantcompared toRAMDirectory. It is hard to say from the information you provided.Did youperhaps lazy load field values from your RAMDirectory and notretrieve them,

or something like that?

Why your queries are slow is also hard to say, there can be manyreaons. 70kdocuments can be quite a few documents for II if they containenough text.

Here are a few questions that may or may not be helpful:

What is the content of the documents? Do they contain a lot of thesametext? Or are they all rather unique? The major thing that makes IIfasterthan RAMDirectory is that it does not have to deserialize valuesfrom thebytestream. As the index grows binary searching for documentscontaining a

given term will start consume more time than deserializing the index.

What speed do you see if you only load 10% (7k)?

Did you see the graphics in the package level javadocs?
http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/store/instantiated/package-summary.html


       karl


26 aug 2010 kl. 09.24 skrev Li Li:

I have about 70k document, the total indexed size is about 15MB(the
orginal text files' size).
              dir=new RAMDirectory();
              IndexWriter write=new IndexWriter(dir,...;
              for(loop){
                   writer.addDocument(doc);
              }
       writer.optimize();
       writer.close();

       IndexReader ir=IndexReader.open(dir,true);
       InstantiatedIndex ii=new InstantiatedIndex(ir);
       InstantiatedIndexReader iir=new InstantiatedIndexReader(ii);
       is=new IndexSearcher(ir);
       is2=new IndexSearcher(iir);

            I calculate the time by:
       long searchStart=System.nanoTime();
       TopDocs docs=is.search(bQuery,Integer.MAX_VALUE);
       long searchEnd=System.nanoTime();

           I searched 10,000 documents and the time of RAMDirectory
and instantiated
           the time used is time1: 21s(21812978000 ns) time2:
20s(20713817000 ns)
           I also calulate the time including get field value:
              total1: 23852ms total2: 22610ms
          it seems instantiated is not much faster than
RAMDirectory. Is there any thing wrong I used? my max memory is 4GB

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: instantiated contrib

Reply via email to