RE: Possible memory leak in Lucene.NET 2.4?

Michael Garski Wed, 06 Jan 2010 11:57:57 -0800

Simone,

There is a trade-off in the use of filters - more memory consumed but
faster performance.  It's a good idea to test both approaches (filter
vs. Boolean clause) to find what works best for you.


Each filter will consume 1 bit in memory for each document plus the
overhead of the object itself.  With 50K documents, each filter would
consume approximately 6.5KB of memory, with 1500 consuming approximately
10MB.  That's not a whole lot of memory, and will give you a bump in
search performance.  How much of a bump, you'll have to test to
determine.  I don't have much experience with indexes of you size,
however for the large indexes we work with the gain can be significant.

Michael 

-----Original Message-----
From: Simone Chiaretta [mailto:[email protected]] 
Sent: Wednesday, January 06, 2010 11:46 AM
To: [email protected]
Subject: Re: Possible memory leak in Lucene.NET 2.4?

Michel,
great suggestions thank you

More below

On Wed, Jan 6, 2010 at 8:29 PM, Michael Garski
<[email protected]>wrote:

> Simone -
>
> Filters will provide for more efficient queries in your case if you
> filter on the blog id rather than using it as a query clause as the
> filter can be cached and re-used for future queries.  Be sure to use
the
> FilterManager to ensure your filters are being cached and not
re-created
> for each query.
>

It makes sense when I've just one blog... but can I have, let's say,
1500
different filters one per blog?
What I need is filtering blog posts based on the blog I'm currently in:
so I've to search over blog 1 if I'm in blog 1, and in blog 1500 if I'm
in
blog 1500.
This will mean I'll have to cache 1500 different filters? Or in this
case a
simple plain query will be better?



> Optimizing on startup could delay the app pool being available.  I'd
> suggest that rather than optimizing on a periodic basis that you
create
> a custom MergePolicy to control the number of segments in the index
and
> when segments are merged.  With 2.9 I take this approach and don't use
> the Optimize call at all anymore.  Provided you don't have thousands
of
> segments, a multi-segment index should not pose a performance issue.
> There are quite a number of other performance improvements that can be
> made that have a bigger impact, such as filters.
>

I'll take a look at that.



>
> The 2.9 version I am using is the current trunk version.  It's very
> stable and I have not encountered any issues.
>

Ok.. great...thx


>
> Retrieving the IndexReader from the IndexWriter will give you the most
> recent changes, including those that have not yet been committed.
>
http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/index/Inde
> xWriter.html#getReader%28%29
>
>
OK... 2.9 seems to be solving most of my problems :)


> Unfortunately I don't have the bandwidth at this time to help you out
> with a code review, but should you have any questions, continue to
post
> them to the list and I'll provide some feedback as time allows.
>

Ok.. no problem... thank you anyway.. you are being very helpful
Simo


>
> Michael
>
> -----Original Message-----
> From: Simone Chiaretta [mailto:[email protected]]
> Sent: Wednesday, January 06, 2010 11:11 AM
> To: [email protected]
> Subject: Re: Possible memory leak in Lucene.NET 2.4?
>
> Hi Michael,
>
> more below
>
> On Wed, Jan 6, 2010 at 7:41 PM, Michael Garski
> <[email protected]>wrote:
>
> > Simone,
> >
> > Filters work to constrain the query to the subset of documents that
> are
> > contained in the filter, which can improve performance.
>
>
> Ok, from what I see, filtering can help me filter out posts from other
> blogs.
> But can filters change with every query?
> What's the difference between:
> query for "xyz" on blog 1 over all index
> vs.
> query for "xyz" over the index filtered by blog 1?
>
>
> > The field cache
> > is used to cache values if you are sorting by something other than
the
> > score, such as by date or some other value in the index.
> >
>
>
> I'm just sorting by score... so probably not needed
>
>
> >
> > Optimizing after each document incurs an unnecessary overhead as all
> > segments are merged into one, which is not necessary even in
versions
> > prior to 2.9.
> >
>
> Great, thank you... I can remove this, would help speed up the add
> document
> procedure on large indexes...
> and since in web app the pool recycles anyway every day or so, doing
an
> optimize at the creation of the index write will be enough, correct?
>
>
> > If your app has not yet been released, I would suggest using 2.9 and
> > ensuring you are not using any methods or properties marked with the
> > Obsolete attribute to streamline migration to future versions.
>
>
> Great... thank you again... is 2.9 the trunk, right? I don't see a tag
> for
> it in SVN
>
>
> > Another
> > change in 2.9 you could take advantage of is retrieving an
IndexReader
> > from the IndexWriter through the GetReader method, which will save
you
> > from having to have both a writer and a reader in application scope.
> > The writer could be held at the application level and the reader
> > retrieved from it directly.
> >
>
> And that will give the most current reader updated with the latest new
> docs?
>
> One last thing:
> Would you be so kind (if you have time, and with the proper credit
given
> in
> the source code and in the release notes) to do a kind of source code
> review
> to the search engine of the blog?
> Thx
>
> Simone
>
>
> >
> > Michael
> >
> >
> > -----Original Message-----
> > From: Simone Chiaretta [mailto:[email protected]]
> > Sent: Wednesday, January 06, 2010 10:28 AM
> > To: [email protected]
> > Subject: Re: Possible memory leak in Lucene.NET 2.4?
> >
> > I'm just using queries... I'm pretty new to Lucene, so I went for
the
> > easier
> > solution.
> > Would you recommend using filters and caching instead of queries?
> >
> > At the moment I'm on Lucene 2.3.1... would you recommend moving to
> 2.9?
> > My app has not been released yet (an open source blogging engine),
but
> > will
> > be shortly.
> > The number of documents indexed will range from 0 to 50.000 blog
posts
> > (our
> > biggest installation atm).
> >
> > Will not optimizing after every new document reduce the performances
> of
> > the
> > searches on such indexes?
> >
> > Simone
> >
> > On Wed, Jan 6, 2010 at 7:08 PM, Michael Garski
> > <[email protected]>wrote:
> >
> > > Simone,
> > >
> > > Are you using any field caches or filters?
> > >
> > > In versions prior to 2.9, reopening the index will completely
> rebuild
> > > the field cache and filter bits for all documents in the index,
> which
> > > can result in an increase in memory consumption.  In 2.9 and
future
> > > versions, the field cache and filter bits are cached at a segment
> > level,
> > > which results in significantly faster re-opens as only the new
> > segments
> > > are loaded into the caches.
> > >
> > > Our applications use very large indexes and 2.9's segment level
> > caching
> > > allows us to re-open indexes much faster while utilizing less
memory
> > in
> > > the process.
> > >
> > > Michael
> > >
> > > -----Original Message-----
> > > From: Simone Chiaretta [mailto:[email protected]]
> > > Sent: Wednesday, January 06, 2010 10:01 AM
> > > To: [email protected]
> > > Subject: Re: Possible memory leak in Lucene.NET 2.4?
> > >
> > > What I am doing is initializing the writer in the App_Start event
of
> > the
> > > web
> > > app, and closing everything at the App_End event.
> > > For the reader, I start it at the first search request, re-open it
> > > everytime
> > > a new document is added, and then closing it in the App_End
> > >
> > > If you are interested here is the search engine service I'm using:
> > >
> >
>
http://code.google.com/p/subtext/source/browse/trunk/src/Subtext.Framewo
> > >
> >
>
rk/Services/SearchEngine/SearchEngineService.cs<http://code.google.com/p
> >
>
/subtext/source/browse/trunk/src/Subtext.Framewo%0Ark/Services/SearchEng
>
<http://code.google.com/p%0A/subtext/source/browse/trunk/src/Subtext.Fra
> mewo%0Ark/Services/SearchEng>
> > ine/SearchEngineService.cs>
> > >
> > > Simone
> > >
> > > On Wed, Jan 6, 2010 at 6:31 PM, Matt Honeycutt
> > > <[email protected]>wrote:
> > >
> > > > Won't the various global application events be fired if the app
> pool
> > > is
> > > > gracefully terminated/recycled?  While not ideal, couldn't you
> > > initialize
> > > > your Lucene objects during one of the application
initialization,
> > then
> > > > dispose of them in the corresponding shutodwn events?
> > > >
> > > > On Wed, Jan 6, 2010 at 11:14 AM, Michael Garski
> > > <[email protected]
> > > > >wrote:
> > > >
> > > > > If it's not an option to create search functionality in a
> separate
> > > > process,
> > > > > such as in a shared hosting environment, you may be limited in
> the
> > > size
> > > > of
> > > > > your index and how you query it.  The field cache, and to a
> lesser
> > > extent
> > > > > filters, will consume a fair amount of memory that is
> proportional
> > > to the
> > > > > number of documents in the index.
> > > > >
> > > > > As others have mentioned, you will have to ensure that
resources
> > are
> > > > > released when the app pool recycles.
> > > > >
> > > > > Michael
> > > > >
> > > > > -----Original Message-----
> > > > > From: Simone Chiaretta [mailto:[email protected]]
> > > > > Sent: Wednesday, January 06, 2010 12:45 AM
> > > > > To: [email protected]
> > > > > Subject: Re: Possible memory leak in Lucene.NET 2.4?
> > > > >
> > > > > Unfortunately not everybody can use another process: I'm
> building
> > a
> > > > > blog engine that must be able to run on shared hosting
provider.
> > The
> > > > > 2nd process is not an option :)
> > > > >
> > > > > Simone
> > > > >
> > > > > On Tuesday, January 5, 2010, Digy <[email protected]> wrote:
> > > > > > As Michael stated, I prefer also not hosting "indexing and
> > > searching
> > > > > > sevices" in IIS.
> > > > > > There are many alternatives such as WCF, Remoting etc. With
a
> > > separate
> > > > > > service for Lucene, you can control anything you want.
> > > > > >
> > > > > > DIGY
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Michael Garski [mailto:[email protected]]
> > > > > > Sent: Tuesday, January 05, 2010 11:11 PM
> > > > > > To: [email protected]
> > > > > > Subject: RE: Possible memory leak in Lucene.NET 2.4?
> > > > > >
> > > > > > Jeff,
> > > > > >
> > > > > > Correct - there is no need to optimize the index after
adding
> a
> > > > > > document, and I would recommend against it especially when
you
> > > move to
> > > > > > 2.9 as you will not see any of the benefits of the changes
to
> > > composite
> > > > > > readers such as faster incremental warm-ups to filters and
> field
> > > > caches.
> > > > > >
> > > > > > I've never run Lucene.Net in the context of a web process
and
> > > would
> > > > > > actually recommend against that approach due to app pool
> > > recycling,
> > > > > > opting for a service that exposed search functionality via
> WCF.
> > > > > >
> > > > > > What types of queries are you executing? Are you using
filters
> > or
> > > > > > sorting?  How often do you re-open the IndexReader that is
> used
> > > for
> > > > > > searching?  Re-opening the reader after each document
addition
> > can
> > > be
> > > > an
> > > > > > expensive process, especially if you are using filters
and/or
> > > sorts.
> > > > > > How are you refreshing the IndexReader?
> > > > > >
> > > > > > Regarding the IndexReader locking files, this is a feature
> which
> > > allows
> > > > > > you to concurrently index and search on the same index and
not
> > > have to
> > > > > > worry about the IndexWriter deleting a segment file from
> > > underneath the
> > > > > > searcher when a segment merge occurs.
> > > > > >
> > > > > > The first place to look would be to use a memory profiler to
> > > determine
> > > > > > what is actually consuming the memory.  I use the SciTech
.NET
> > > Memory
> > > > > > Profiler for such purposes.
> > > > > >
> > > > > > Michael
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Jeff Pennal [mailto:[email protected]]
> > > > > > Sent: Tuesday, January 05, 2010 12:42 PM
> > > > > > To: [email protected]
> > > > > > Subject: Possible memory leak in Lucene.NET 2.4?
> > > > > >
> > > > > > Hello all,
> > > > > >
> > > > > > In doing some profiling of our Lucene code, I noticed that
we
> > were
> > > > doing
> > > > > >
> > > > > > an optimize code after every update to our index. Though our
> > index
> > > is
> > > > > > relatively small (~75MB), the optimize task still look way
to
> > much
> > > time
> > > > > > to run.
> > > > > >
> > > > > > I did some research and it seems like it would not be an
issue
> > to
> > > > update
> > > > > >
> > > > > > our index without optimizing afterwords, the side effect
being
> > > that
> > > > we'd
> > > > > >
> > > > > > have more open file handles.
> > > > > >
> > > > > > I made that change and noticed some horrible performance
side
> > > effects.
> > > > > >
> > > > > > The first thing I noticed was that the CPU for our web
> > application
> > > > > > (ASP.NET MVC) that read from the Index never went below
60-70%
> > and
> > > was
> > > > > > frequently pegged at 99%.
> > > > > >
> > > > > > In addition to the CPU spiking, the memory taken up by the
> > > w3wp.exe
> > > > > > process quickly grew to around 800MB, which is about 300MB
> above
> > > > normal.
> > > > > >
> > > > > > This has all the hallmarks of a memory leak somewhere.
> > > > > >
> > > > > > Finally, I noticed that the IndexReader was locking some of
> the
> > > files
> > > > in
> > > > > >
> > > > > > the index folder even though the reader was set to nolock
> mode.
> > > This
> > > > > > seemed to be cause of the increase in the number of files in
> the
> > > index
> > > > > > folder.
> > > > > >
> > > > > > We have the IndexReader set to open once and then be shared
> > among
> > > every
> > > > > > request to the web application. My understanding is that
this
> is
> > > the
> > > > > > correct way to do this, and this never caused and issues
when
> we
> > > were
> > > > > > optimizing the index after every update.
> > > > > >
> > > > > > I know this is a pretty vague problem and there could be any
> > > number of
> > > > > > issues involved here. However, if anyone could suggest areas
> to
> > > look
> > > > > > into for possible solutions, it would be greatly
appreciated.
> > > > > >
> > > > > > Thanks,
> > > > > > Jeff
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > > --
> > > > > Simone Chiaretta
> > > > > Microsoft MVP ASP.NET - ASPInsider
> > > > > Blog: http://codeclimber.net.nz
> > > > > RSS: http://feeds2.feedburner.com/codeclimber
> > > > > twitter: @simonech
> > > > >
> > > > > Any sufficiently advanced technology is indistinguishable from
> > magic
> > > > > "Life is short, play hard"
> > > > >
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Simone Chiaretta
> > > Microsoft MVP ASP.NET - ASPInsider
> > > Blog: http://codeclimber.net.nz
> > > RSS: http://feeds2.feedburner.com/codeclimber
> > > twitter: @simonech
> > >
> > > Any sufficiently advanced technology is indistinguishable from
magic
> > > "Life is short, play hard"
> > >
> > >
> >
> >
> > --
> > Simone Chiaretta
> > Microsoft MVP ASP.NET - ASPInsider
> > Blog: http://codeclimber.net.nz
> > RSS: http://feeds2.feedburner.com/codeclimber
> > twitter: @simonech
> >
> > Any sufficiently advanced technology is indistinguishable from magic
> > "Life is short, play hard"
> >
> >
>
>
> --
> Simone Chiaretta
> Microsoft MVP ASP.NET - ASPInsider
> Blog: http://codeclimber.net.nz
> RSS: http://feeds2.feedburner.com/codeclimber
> twitter: @simonech
>
> Any sufficiently advanced technology is indistinguishable from magic
> "Life is short, play hard"
>
>


-- 
Simone Chiaretta
Microsoft MVP ASP.NET - ASPInsider
Blog: http://codeclimber.net.nz
RSS: http://feeds2.feedburner.com/codeclimber
twitter: @simonech

Any sufficiently advanced technology is indistinguishable from magic
"Life is short, play hard"

RE: Possible memory leak in Lucene.NET 2.4?

Reply via email to