[jira] Commented: (LUCENE-505) MultiReader.norm() takes up too much memory: norms byte[] should be made into an Object

Uwe Schindler (JIRA) Thu, 15 Jan 2009 13:30:23 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664272#action_12664272
 ]


Uwe Schindler commented on LUCENE-505:
--------------------------------------

In my opinion the problem with large indexes is more, that each SegmentReader 
has a cache of the last used norms. If you have many fields with norms enabled 
the cache grows and is never freed. In my opinion, the cache should be a LRU 
cache or a WeakHashMap or something like that.
You can see this problem, if you create an index with many fields with norms (I 
tested with about 4,000 fields) and many documents (half a million). If you 
then call CheckIndex, that calls norms() for each (!) field in the Segment and 
each of this calls creates a new cache entry, you get OutOfMemoryExceptions 
after short time (I tested with the above index: I was not able to do a 
CheckIndex even with "-Xmx 16GB" on 64bit Java).

> MultiReader.norm() takes up too much memory: norms byte[] should be made into 
> an Object
> ---------------------------------------------------------------------------------------
>
>                 Key: LUCENE-505
>                 URL: https://issues.apache.org/jira/browse/LUCENE-505
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.0.0
>         Environment: Patch is against Lucene 1.9 trunk (as of Mar 1 06)
>            Reporter: Steven Tamm
>            Priority: Minor
>         Attachments: LazyNorms.patch, NormFactors.patch, NormFactors.patch, 
> NormFactors20.patch
>
>
> MultiReader.norms() is very inefficient: it has to construct a byte array 
> that's as long as all the documents in every segment.  This doubles the 
> memory requirement for scoring MultiReaders vs. Segment Readers.  Although 
> this is cached, it's still a baseline of memory that is unnecessary.
> The problem is that the Normalization Factors are passed around as a byte[].  
> If it were instead replaced with an Object, you could perform a whole host of 
> optimizations
> a.  When reading, you wouldn't have to construct a "fakeNorms" array of all 
> 1.0fs.  You could instead return a singleton object that would just return 
> 1.0f.
> b.  MultiReader could use an object that could delegate to NormFactors of the 
> subreaders
> c.  You could write an implementation that could use mmap to access the norm 
> factors.  Or if the index isn't long lived, you could use an implementation 
> that reads directly from the disk.
> The patch provided here replaces the use of byte[] with a new abstract class 
> called NormFactors.  
> NormFactors has two methods on it
>     public abstract byte getByte(int doc) throws IOException;  // Returns the 
> byte[doc]
>     public float getFactor(int doc) throws IOException;            // Calls 
> Similarity.decodeNorm(getByte(doc))
> There are four implementations of this abstract class
> 1.  NormFactors.EmptyNormFactors - This replaces the fakeNorms with a 
> singleton that only returns 1.0
> 2.  NormFactors.ByteNormFactors - Converts a byte[] to a NormFactors for 
> backwards compatibility in constructors.
> 3.  MultiNormFactors - Multiplexes the NormFactors in MultiReader to prevent 
> the need to construct the gigantic norms array.
> 4.  SegmentReader.Norm - Same class, but now extends NormFactors to provide 
> the same access.
> In addition, Many of the Query and Scorer classes were changes to pass around 
> NormFactors instead of byte[], and to call getFactor() instead of using the 
> byte[].  I have kept around IndexReader.norms(String) for backwards 
> compatibiltiy, but marked it as deprecated.  I believe that the use of 
> ByteNormFactors in IndexReader.getNormFactors() will keep backward 
> compatibility with other IndexReader implementations, but I don't know how to 
> test that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-505) MultiReader.norm() takes up too much memory: norms byte[] should be made into an Object

Reply via email to