Re: Possible memory leak in Lucene.NET 2.4?

Simone Chiaretta Wed, 06 Jan 2010 11:46:04 -0800

Michel,
great suggestions thank you

More below


On Wed, Jan 6, 2010 at 8:29 PM, Michael Garski <[email protected]>wrote:

> Simone -
>
> Filters will provide for more efficient queries in your case if you
> filter on the blog id rather than using it as a query clause as the
> filter can be cached and re-used for future queries.  Be sure to use the
> FilterManager to ensure your filters are being cached and not re-created
> for each query.
>

It makes sense when I've just one blog... but can I have, let's say, 1500
different filters one per blog?
What I need is filtering blog posts based on the blog I'm currently in:
so I've to search over blog 1 if I'm in blog 1, and in blog 1500 if I'm in
blog 1500.
This will mean I'll have to cache 1500 different filters? Or in this case a
simple plain query will be better?



> Optimizing on startup could delay the app pool being available.  I'd
> suggest that rather than optimizing on a periodic basis that you create
> a custom MergePolicy to control the number of segments in the index and
> when segments are merged.  With 2.9 I take this approach and don't use
> the Optimize call at all anymore.  Provided you don't have thousands of
> segments, a multi-segment index should not pose a performance issue.
> There are quite a number of other performance improvements that can be
> made that have a bigger impact, such as filters.
>

I'll take a look at that.



>
> The 2.9 version I am using is the current trunk version.  It's very
> stable and I have not encountered any issues.
>

Ok.. great...thx


>
> Retrieving the IndexReader from the IndexWriter will give you the most
> recent changes, including those that have not yet been committed.
> http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/index/Inde
> xWriter.html#getReader%28%29
>
>
OK... 2.9 seems to be solving most of my problems :)


> Unfortunately I don't have the bandwidth at this time to help you out
> with a code review, but should you have any questions, continue to post
> them to the list and I'll provide some feedback as time allows.
>

Ok.. no problem... thank you anyway.. you are being very helpful
Simo


>
> Michael
>
> -----Original Message-----
> From: Simone Chiaretta [mailto:[email protected]]
> Sent: Wednesday, January 06, 2010 11:11 AM
> To: [email protected]
> Subject: Re: Possible memory leak in Lucene.NET 2.4?
>
> Hi Michael,
>
> more below
>
> On Wed, Jan 6, 2010 at 7:41 PM, Michael Garski
> <[email protected]>wrote:
>
> > Simone,
> >
> > Filters work to constrain the query to the subset of documents that
> are
> > contained in the filter, which can improve performance.
>
>
> Ok, from what I see, filtering can help me filter out posts from other
> blogs.
> But can filters change with every query?
> What's the difference between:
> query for "xyz" on blog 1 over all index
> vs.
> query for "xyz" over the index filtered by blog 1?
>
>
> > The field cache
> > is used to cache values if you are sorting by something other than the
> > score, such as by date or some other value in the index.
> >
>
>
> I'm just sorting by score... so probably not needed
>
>
> >
> > Optimizing after each document incurs an unnecessary overhead as all
> > segments are merged into one, which is not necessary even in versions
> > prior to 2.9.
> >
>
> Great, thank you... I can remove this, would help speed up the add
> document
> procedure on large indexes...
> and since in web app the pool recycles anyway every day or so, doing an
> optimize at the creation of the index write will be enough, correct?
>
>
> > If your app has not yet been released, I would suggest using 2.9 and
> > ensuring you are not using any methods or properties marked with the
> > Obsolete attribute to streamline migration to future versions.
>
>
> Great... thank you again... is 2.9 the trunk, right? I don't see a tag
> for
> it in SVN
>
>
> > Another
> > change in 2.9 you could take advantage of is retrieving an IndexReader
> > from the IndexWriter through the GetReader method, which will save you
> > from having to have both a writer and a reader in application scope.
> > The writer could be held at the application level and the reader
> > retrieved from it directly.
> >
>
> And that will give the most current reader updated with the latest new
> docs?
>
> One last thing:
> Would you be so kind (if you have time, and with the proper credit given
> in
> the source code and in the release notes) to do a kind of source code
> review
> to the search engine of the blog?
> Thx
>
> Simone
>
>
> >
> > Michael
> >
> >
> > -----Original Message-----
> > From: Simone Chiaretta [mailto:[email protected]]
> > Sent: Wednesday, January 06, 2010 10:28 AM
> > To: [email protected]
> > Subject: Re: Possible memory leak in Lucene.NET 2.4?
> >
> > I'm just using queries... I'm pretty new to Lucene, so I went for the
> > easier
> > solution.
> > Would you recommend using filters and caching instead of queries?
> >
> > At the moment I'm on Lucene 2.3.1... would you recommend moving to
> 2.9?
> > My app has not been released yet (an open source blogging engine), but
> > will
> > be shortly.
> > The number of documents indexed will range from 0 to 50.000 blog posts
> > (our
> > biggest installation atm).
> >
> > Will not optimizing after every new document reduce the performances
> of
> > the
> > searches on such indexes?
> >
> > Simone
> >
> > On Wed, Jan 6, 2010 at 7:08 PM, Michael Garski
> > <[email protected]>wrote:
> >
> > > Simone,
> > >
> > > Are you using any field caches or filters?
> > >
> > > In versions prior to 2.9, reopening the index will completely
> rebuild
> > > the field cache and filter bits for all documents in the index,
> which
> > > can result in an increase in memory consumption.  In 2.9 and future
> > > versions, the field cache and filter bits are cached at a segment
> > level,
> > > which results in significantly faster re-opens as only the new
> > segments
> > > are loaded into the caches.
> > >
> > > Our applications use very large indexes and 2.9's segment level
> > caching
> > > allows us to re-open indexes much faster while utilizing less memory
> > in
> > > the process.
> > >
> > > Michael
> > >
> > > -----Original Message-----
> > > From: Simone Chiaretta [mailto:[email protected]]
> > > Sent: Wednesday, January 06, 2010 10:01 AM
> > > To: [email protected]
> > > Subject: Re: Possible memory leak in Lucene.NET 2.4?
> > >
> > > What I am doing is initializing the writer in the App_Start event of
> > the
> > > web
> > > app, and closing everything at the App_End event.
> > > For the reader, I start it at the first search request, re-open it
> > > everytime
> > > a new document is added, and then closing it in the App_End
> > >
> > > If you are interested here is the search engine service I'm using:
> > >
> >
> http://code.google.com/p/subtext/source/browse/trunk/src/Subtext.Framewo
> > >
> >
> rk/Services/SearchEngine/SearchEngineService.cs<http://code.google.com/p
> >
> /subtext/source/browse/trunk/src/Subtext.Framewo%0Ark/Services/SearchEng
> <http://code.google.com/p%0A/subtext/source/browse/trunk/src/Subtext.Fra
> mewo%0Ark/Services/SearchEng>
> > ine/SearchEngineService.cs>
> > >
> > > Simone
> > >
> > > On Wed, Jan 6, 2010 at 6:31 PM, Matt Honeycutt
> > > <[email protected]>wrote:
> > >
> > > > Won't the various global application events be fired if the app
> pool
> > > is
> > > > gracefully terminated/recycled?  While not ideal, couldn't you
> > > initialize
> > > > your Lucene objects during one of the application initialization,
> > then
> > > > dispose of them in the corresponding shutodwn events?
> > > >
> > > > On Wed, Jan 6, 2010 at 11:14 AM, Michael Garski
> > > <[email protected]
> > > > >wrote:
> > > >
> > > > > If it's not an option to create search functionality in a
> separate
> > > > process,
> > > > > such as in a shared hosting environment, you may be limited in
> the
> > > size
> > > > of
> > > > > your index and how you query it.  The field cache, and to a
> lesser
> > > extent
> > > > > filters, will consume a fair amount of memory that is
> proportional
> > > to the
> > > > > number of documents in the index.
> > > > >
> > > > > As others have mentioned, you will have to ensure that resources
> > are
> > > > > released when the app pool recycles.
> > > > >
> > > > > Michael
> > > > >
> > > > > -----Original Message-----
> > > > > From: Simone Chiaretta [mailto:[email protected]]
> > > > > Sent: Wednesday, January 06, 2010 12:45 AM
> > > > > To: [email protected]
> > > > > Subject: Re: Possible memory leak in Lucene.NET 2.4?
> > > > >
> > > > > Unfortunately not everybody can use another process: I'm
> building
> > a
> > > > > blog engine that must be able to run on shared hosting provider.
> > The
> > > > > 2nd process is not an option :)
> > > > >
> > > > > Simone
> > > > >
> > > > > On Tuesday, January 5, 2010, Digy <[email protected]> wrote:
> > > > > > As Michael stated, I prefer also not hosting "indexing and
> > > searching
> > > > > > sevices" in IIS.
> > > > > > There are many alternatives such as WCF, Remoting etc. With a
> > > separate
> > > > > > service for Lucene, you can control anything you want.
> > > > > >
> > > > > > DIGY
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Michael Garski [mailto:[email protected]]
> > > > > > Sent: Tuesday, January 05, 2010 11:11 PM
> > > > > > To: [email protected]
> > > > > > Subject: RE: Possible memory leak in Lucene.NET 2.4?
> > > > > >
> > > > > > Jeff,
> > > > > >
> > > > > > Correct - there is no need to optimize the index after adding
> a
> > > > > > document, and I would recommend against it especially when you
> > > move to
> > > > > > 2.9 as you will not see any of the benefits of the changes to
> > > composite
> > > > > > readers such as faster incremental warm-ups to filters and
> field
> > > > caches.
> > > > > >
> > > > > > I've never run Lucene.Net in the context of a web process and
> > > would
> > > > > > actually recommend against that approach due to app pool
> > > recycling,
> > > > > > opting for a service that exposed search functionality via
> WCF.
> > > > > >
> > > > > > What types of queries are you executing? Are you using filters
> > or
> > > > > > sorting?  How often do you re-open the IndexReader that is
> used
> > > for
> > > > > > searching?  Re-opening the reader after each document addition
> > can
> > > be
> > > > an
> > > > > > expensive process, especially if you are using filters and/or
> > > sorts.
> > > > > > How are you refreshing the IndexReader?
> > > > > >
> > > > > > Regarding the IndexReader locking files, this is a feature
> which
> > > allows
> > > > > > you to concurrently index and search on the same index and not
> > > have to
> > > > > > worry about the IndexWriter deleting a segment file from
> > > underneath the
> > > > > > searcher when a segment merge occurs.
> > > > > >
> > > > > > The first place to look would be to use a memory profiler to
> > > determine
> > > > > > what is actually consuming the memory.  I use the SciTech .NET
> > > Memory
> > > > > > Profiler for such purposes.
> > > > > >
> > > > > > Michael
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Jeff Pennal [mailto:[email protected]]
> > > > > > Sent: Tuesday, January 05, 2010 12:42 PM
> > > > > > To: [email protected]
> > > > > > Subject: Possible memory leak in Lucene.NET 2.4?
> > > > > >
> > > > > > Hello all,
> > > > > >
> > > > > > In doing some profiling of our Lucene code, I noticed that we
> > were
> > > > doing
> > > > > >
> > > > > > an optimize code after every update to our index. Though our
> > index
> > > is
> > > > > > relatively small (~75MB), the optimize task still look way to
> > much
> > > time
> > > > > > to run.
> > > > > >
> > > > > > I did some research and it seems like it would not be an issue
> > to
> > > > update
> > > > > >
> > > > > > our index without optimizing afterwords, the side effect being
> > > that
> > > > we'd
> > > > > >
> > > > > > have more open file handles.
> > > > > >
> > > > > > I made that change and noticed some horrible performance side
> > > effects.
> > > > > >
> > > > > > The first thing I noticed was that the CPU for our web
> > application
> > > > > > (ASP.NET MVC) that read from the Index never went below 60-70%
> > and
> > > was
> > > > > > frequently pegged at 99%.
> > > > > >
> > > > > > In addition to the CPU spiking, the memory taken up by the
> > > w3wp.exe
> > > > > > process quickly grew to around 800MB, which is about 300MB
> above
> > > > normal.
> > > > > >
> > > > > > This has all the hallmarks of a memory leak somewhere.
> > > > > >
> > > > > > Finally, I noticed that the IndexReader was locking some of
> the
> > > files
> > > > in
> > > > > >
> > > > > > the index folder even though the reader was set to nolock
> mode.
> > > This
> > > > > > seemed to be cause of the increase in the number of files in
> the
> > > index
> > > > > > folder.
> > > > > >
> > > > > > We have the IndexReader set to open once and then be shared
> > among
> > > every
> > > > > > request to the web application. My understanding is that this
> is
> > > the
> > > > > > correct way to do this, and this never caused and issues when
> we
> > > were
> > > > > > optimizing the index after every update.
> > > > > >
> > > > > > I know this is a pretty vague problem and there could be any
> > > number of
> > > > > > issues involved here. However, if anyone could suggest areas
> to
> > > look
> > > > > > into for possible solutions, it would be greatly appreciated.
> > > > > >
> > > > > > Thanks,
> > > > > > Jeff
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > > --
> > > > > Simone Chiaretta
> > > > > Microsoft MVP ASP.NET - ASPInsider
> > > > > Blog: http://codeclimber.net.nz
> > > > > RSS: http://feeds2.feedburner.com/codeclimber
> > > > > twitter: @simonech
> > > > >
> > > > > Any sufficiently advanced technology is indistinguishable from
> > magic
> > > > > "Life is short, play hard"
> > > > >
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Simone Chiaretta
> > > Microsoft MVP ASP.NET - ASPInsider
> > > Blog: http://codeclimber.net.nz
> > > RSS: http://feeds2.feedburner.com/codeclimber
> > > twitter: @simonech
> > >
> > > Any sufficiently advanced technology is indistinguishable from magic
> > > "Life is short, play hard"
> > >
> > >
> >
> >
> > --
> > Simone Chiaretta
> > Microsoft MVP ASP.NET - ASPInsider
> > Blog: http://codeclimber.net.nz
> > RSS: http://feeds2.feedburner.com/codeclimber
> > twitter: @simonech
> >
> > Any sufficiently advanced technology is indistinguishable from magic
> > "Life is short, play hard"
> >
> >
>
>
> --
> Simone Chiaretta
> Microsoft MVP ASP.NET - ASPInsider
> Blog: http://codeclimber.net.nz
> RSS: http://feeds2.feedburner.com/codeclimber
> twitter: @simonech
>
> Any sufficiently advanced technology is indistinguishable from magic
> "Life is short, play hard"
>
>


-- 
Simone Chiaretta
Microsoft MVP ASP.NET - ASPInsider
Blog: http://codeclimber.net.nz
RSS: http://feeds2.feedburner.com/codeclimber
twitter: @simonech

Any sufficiently advanced technology is indistinguishable from magic
"Life is short, play hard"

Re: Possible memory leak in Lucene.NET 2.4?

Reply via email to