[jira] Updated: (LUCENE-505) MultiReader.norm() takes up too much memory: norms byte[] should be made into an Object

Doug Cutting (JIRA) Wed, 01 Mar 2006 14:25:47 -0800

     [ http://issues.apache.org/jira/browse/LUCENE-505?page=all ]


Doug Cutting updated LUCENE-505:
--------------------------------

    Version: 2.0
                 (was: 1.9)

I don't see how the memory requirements of MultiReader are twice that of 
SegmentReader.  MultiReader does not call norms(String) on each sub-reader, but 
rather norms(String, byte[], int), storing them in a previously allocated 
array, so the sub-reader normally never constructs an array for its norms.

I also worry about performance with this change.  Have you benchmarked this 
while searching large indexes?  For example, in TermScorer.score(HitCollector, 
int), Lucene's innermost loop, you change two array accesses into a call to an 
interface.  That could make a substantial difference.  Small changes to that 
method can cause significant performance changes.

The biggest advantage of this to my eye is the removal of fakeNorms, but I 
think those are only rarely used, and even those uses can be eliminated.  One 
can now omit norms when indexing, and, if such a field is searched with a 
normal query then fakeNorms will be used.  But a ConstantScoringQuery of the 
field should return the same results, and faster too!  So the bug to fix is 
that, when a query is run against a field with omitted norms it should 
automatically be rewritten as a ConstantScoringQuery, both for speed and to 
avoid allocating fakeNorms.

Finally, a note for other committers: we should try not to deprecate anything 
in Lucene until we finish removing all of the methods that were deprecated in 
1.9, to minimize confusion.  Ideally we can avoid having anything deprecated 
until after 2.0 is out the door.

> MultiReader.norm() takes up too much memory: norms byte[] should be made into 
> an Object
> ---------------------------------------------------------------------------------------
>
>          Key: LUCENE-505
>          URL: http://issues.apache.org/jira/browse/LUCENE-505
>      Project: Lucene - Java
>         Type: Improvement
>   Components: Index
>     Versions: 2.0
>  Environment: Patch is against Lucene 1.9 trunk (as of Mar 1 06)
>     Reporter: Steven Tamm
>  Attachments: NormFactors.patch, NormFactors.patch
>
> MultiReader.norms() is very inefficient: it has to construct a byte array 
> that's as long as all the documents in every segment.  This doubles the 
> memory requirement for scoring MultiReaders vs. Segment Readers.  Although 
> this is cached, it's still a baseline of memory that is unnecessary.
> The problem is that the Normalization Factors are passed around as a byte[].  
> If it were instead replaced with an Object, you could perform a whole host of 
> optimizations
> a.  When reading, you wouldn't have to construct a "fakeNorms" array of all 
> 1.0fs.  You could instead return a singleton object that would just return 
> 1.0f.
> b.  MultiReader could use an object that could delegate to NormFactors of the 
> subreaders
> c.  You could write an implementation that could use mmap to access the norm 
> factors.  Or if the index isn't long lived, you could use an implementation 
> that reads directly from the disk.
> The patch provided here replaces the use of byte[] with a new abstract class 
> called NormFactors.  
> NormFactors has two methods on it
>     public abstract byte getByte(int doc) throws IOException;  // Returns the 
> byte[doc]
>     public float getFactor(int doc) throws IOException;            // Calls 
> Similarity.decodeNorm(getByte(doc))
> There are four implementations of this abstract class
> 1.  NormFactors.EmptyNormFactors - This replaces the fakeNorms with a 
> singleton that only returns 1.0
> 2.  NormFactors.ByteNormFactors - Converts a byte[] to a NormFactors for 
> backwards compatibility in constructors.
> 3.  MultiNormFactors - Multiplexes the NormFactors in MultiReader to prevent 
> the need to construct the gigantic norms array.
> 4.  SegmentReader.Norm - Same class, but now extends NormFactors to provide 
> the same access.
> In addition, Many of the Query and Scorer classes were changes to pass around 
> NormFactors instead of byte[], and to call getFactor() instead of using the 
> byte[].  I have kept around IndexReader.norms(String) for backwards 
> compatibiltiy, but marked it as deprecated.  I believe that the use of 
> ByteNormFactors in IndexReader.getNormFactors() will keep backward 
> compatibility with other IndexReader implementations, but I don't know how to 
> test that.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-505) MultiReader.norm() takes up too much memory: norms byte[] should be made into an Object

Reply via email to