Re: multithreading in SegmentsReader

Dmitry Serebrennikov Thu, 11 Oct 2001 10:10:13 -0700

Doug Cutting wrote:

>>From: Dmitry Serebrennikov [mailto:[EMAIL PROTECTED]]
>>
>>Yes, that sounds fine. Delete can definetely just be a synchronized 
>>method. And so can the numDocs unless it is called a lot. Is 
>>it? If it 
>>is, we may want to leave the upfront check in there before it is 
>>synchronized.
>>
>
>It should not be called in inner loops.  Synchronized methods don't add much
>overhead.  The main problem is if they are slow then they can become a
>bottleneck for multi-threading.  This method is only slow the first time it
>is called, when it actually needs to lock out other threads, so I don't
>think making it synchronized is a problem.
>
Ok. I guess VMs have improved and now the "conventional wisdom" has 
changed. It used to be that synchronization by itself was a performance 
hit, regardless of how it effected other threads. It had something to do 
with how it was implemented in the VM. But this information is at least 
a couple of years old.

>
>
>>>Thanks for spotting this.
>>>
>>No problem :). I'm learning a lot by taking the code apart 
>>and figuring out what goes where.
>>
>
>I think that you are giving Lucene one of the closest readings that it has
>had!
>
>>I was just 
>>going through the SegmentsTermEnum and noticed that in the 
>>next() method 
>>it does a queue.pop() followed by queue.put(), instead of the 
>>queue.adjustTop() which the comment says is at least two 
>>times faster. I 
>>guess the adjustTop was added after this code was written,
>>
>
>I think you're right.  I added adjustTop() when writing the phrase scorers,
>then realized that I couldn't use it there, and never looked for other
>places to use it.
>
>With this optimization that loop becomes:
>    
>    while (top != null && term.compareTo(top.term) == 0) {
>      docFreq += top.termEnum.docFreq();          // increment freq
>      if (top.next())
>        queue.adjustTop();                        // restore queue
>      else {
>        queue.pop();
>        top.close();                              // done with a segment
>      }
>      top = (SegmentMergeInfo)queue.top();
>    }
>
>I tested it and it seems to work, but I couldn't notice any speed
>difference.  I will wait until after the 1.2 final release before submitting
>this change.
>
Fair enough.

>
>
>This could also potentially be used in SegmentMerger, but the changes would
>be a lot more complex.  SegmentMerger could in fact be written in terms of
>SegmentsReader, but it would probably slow index merging down, since the
>TermInfo would be read twice for each term, once by SegmentsTermEnum and
>once by SegmentsTermDocs.  My inclination is to leave SegmentMerger alone.
>The priority queue access is probably not significant there anyway.
>
I had not looked at SegmentMerger yet, I think that's because I'm more 
concerned with query speed and that one I think is used at index 
creation / optimization time. I should get there soon.

But I was looking again at the MultiSearcher after reading through the 
SegmentsReader (and friends) and I was thinking if it wouldn't be better 
to write MultiSearcher not in terms of searching over multiple 
Searchers, but as an IndexReader that merges segments from more than one 
directory. A lot of the issues that MultiSearcher has to solve are also 
solved in the SegmentsReader, but slightly differently. Also, 
MultiSearcher has to re-implement the methods of Searcher (like the low 
level search API that was added recently). Another benefit would be that 
one could get a TermEnum that spans multiple indexes. This logic already 
exists in SegmentsReader but the way things stand with MultiSearcher it 
would need to reimplement it right now. This certainly is a long term 
change, but do you think it would fly? I guess the API could be:

in class IndexReader {
    public static IndexReader open(Directory dirs[]);
}

>
>
>Doug
>
Re: multithreading in SegmentsReader

Reply via email to