[jira] Issue Comment Edited: (LUCENE-505) MultiReader.norm() takes up too much memory: norms byte[] should be made into an Object

Uwe Schindler (JIRA) Thu, 15 Jan 2009 14:20:22 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664290#action_12664290
 ]


thetaphi edited comment on LUCENE-505 at 1/15/09 2:19 PM:
---------------------------------------------------------------

{quote}
bq. In my opinion the problem with large indexes is more, that each 
SegmentReader has a cache of the last used norms.

I believe when MultiReader.norms is called (as Doug & Yonik said above), the 
underlying SegmentReaders do not in fact cache the norms (this is not readily 
obvious until you scrutinize the code). Ie, it's only MultiReader that caches 
the full array.
{quote}
In my opinion, this is not correct. I did not use a MultiReader. CheckIndex 
opens and then tests each segment with a separate SegmentReader. The big index 
with the OutOfMemory problem was optimized, so consisting of one segment with 
about half a million docs and about 4,000 fields. Each byte[] array takes about 
a half MiB for this index. The CheckIndex funtion created the norm for 4000 
fields and the SegmentReader cached them, which is about 2 GiB RAM. So OOMs are 
not unusal.

The code taken from SegmentReader is here:
{code}
  protected synchronized byte[] getNorms(String field) throws IOException {
    Norm norm = (Norm) norms.get(field);
    if (norm == null) return null;  // not indexed, or norms not stored
    synchronized(norm) {
      if (norm.bytes == null) {                     // value not yet read
        byte[] bytes = new byte[maxDoc()];
        norms(field, bytes, 0);
        norm.bytes = bytes;                         // cache it
        // it's OK to close the underlying IndexInput as we have cached the
        // norms and will never read them again.
        norm.close();
      }
      return norm.bytes;
    }
  }
{code}
Each reader contains a Map of Norm entries for each field. When for the first 
time the norm for a specific field are read, norm.bytes==null and then it is 
cached inside this Norm object. And it is never freed.

In my opinion, the best would be to use a Weak- or better a SoftReference so 
norms.bytes gets java.lang.ref.SoftReference<byte[]> and used for caching.

I will prepare a patch, should I open a new issue for that? I found this 
problem yesterday when testing with very large indexes (you may have noticed my 
mail about removing norms from Trie fields).

Uwe

      was (Author: thetaphi):
    {quote}
bq. In my opinion the problem with large indexes is more, that each 
SegmentReader has a cache of the last used norms.

I believe when MultiReader.norms is called (as Doug & Yonik said above), the 
underlying SegmentReaders do not in fact cache the norms (this is not readily 
obvious until you scrutinize the code). Ie, it's only MultiReader that caches 
the full array.
{quote}
In my opinion, this is not correct. I did not use a MultiReader. CheckIndex 
opens and then tests each segment with a separate SegmentReader. The big index 
with the OutOfMemory problem was optimized, so consisting of one segment with 
about half a million docs and about 4,000 fields. Each byte[] array takes about 
a half MiB for this index. The CheckIndex funtion created the norm for 4000 
fields and the SegmentReader cached them, which is about 2 GiB RAM. So OOMs are 
not unusal.

The code taken from SegmentReader is here:
{code}
  protected synchronized byte[] getNorms(String field) throws IOException {
    Norm norm = (Norm) norms.get(field);
    if (norm == null) return null;  // not indexed, or norms not stored
    synchronized(norm) {
      if (norm.bytes == null) {                     // value not yet read
        byte[] bytes = new byte[maxDoc()];
        norms(field, bytes, 0);
        norm.bytes = bytes;                         // cache it
        // it's OK to close the underlying IndexInput as we have cached the
        // norms and will never read them again.
        norm.close();
      }
      return norm.bytes;
    }
  }
{code}
Each reader contains a Map of Norm entries for each field. When for the first 
time the norm for a specific field are read, norm.bytes==null and then it is 
cached inside this Norm object. And it is never freed.

In my opinion, the best would be to use a WeakReference so norms.bytes gets 
WeakReference<byte[]> and used for caching.

I will prepare a patch, should I open a new issue for that? I found this 
problem yesterday when testing with very large indexes (you may have noticed my 
mail about removing norms from Trie fields).

Uwe
  
> MultiReader.norm() takes up too much memory: norms byte[] should be made into 
> an Object
> ---------------------------------------------------------------------------------------
>
>                 Key: LUCENE-505
>                 URL: https://issues.apache.org/jira/browse/LUCENE-505
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.0.0
>         Environment: Patch is against Lucene 1.9 trunk (as of Mar 1 06)
>            Reporter: Steven Tamm
>            Priority: Minor
>         Attachments: LazyNorms.patch, NormFactors.patch, NormFactors.patch, 
> NormFactors20.patch
>
>
> MultiReader.norms() is very inefficient: it has to construct a byte array 
> that's as long as all the documents in every segment.  This doubles the 
> memory requirement for scoring MultiReaders vs. Segment Readers.  Although 
> this is cached, it's still a baseline of memory that is unnecessary.
> The problem is that the Normalization Factors are passed around as a byte[].  
> If it were instead replaced with an Object, you could perform a whole host of 
> optimizations
> a.  When reading, you wouldn't have to construct a "fakeNorms" array of all 
> 1.0fs.  You could instead return a singleton object that would just return 
> 1.0f.
> b.  MultiReader could use an object that could delegate to NormFactors of the 
> subreaders
> c.  You could write an implementation that could use mmap to access the norm 
> factors.  Or if the index isn't long lived, you could use an implementation 
> that reads directly from the disk.
> The patch provided here replaces the use of byte[] with a new abstract class 
> called NormFactors.  
> NormFactors has two methods on it
>     public abstract byte getByte(int doc) throws IOException;  // Returns the 
> byte[doc]
>     public float getFactor(int doc) throws IOException;            // Calls 
> Similarity.decodeNorm(getByte(doc))
> There are four implementations of this abstract class
> 1.  NormFactors.EmptyNormFactors - This replaces the fakeNorms with a 
> singleton that only returns 1.0
> 2.  NormFactors.ByteNormFactors - Converts a byte[] to a NormFactors for 
> backwards compatibility in constructors.
> 3.  MultiNormFactors - Multiplexes the NormFactors in MultiReader to prevent 
> the need to construct the gigantic norms array.
> 4.  SegmentReader.Norm - Same class, but now extends NormFactors to provide 
> the same access.
> In addition, Many of the Query and Scorer classes were changes to pass around 
> NormFactors instead of byte[], and to call getFactor() instead of using the 
> byte[].  I have kept around IndexReader.norms(String) for backwards 
> compatibiltiy, but marked it as deprecated.  I believe that the use of 
> ByteNormFactors in IndexReader.getNormFactors() will keep backward 
> compatibility with other IndexReader implementations, but I don't know how to 
> test that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Issue Comment Edited: (LUCENE-505) MultiReader.norm() takes up too much memory: norms byte[] should be made into an Object

Reply via email to