Re: Possible memory leak in Lucene.NET 2.4?

Simone Chiaretta Wed, 06 Jan 2010 12:08:41 -0800

OK... thank you
I'll try
Simo

On Wed, Jan 6, 2010 at 8:57 PM, Michael Garski <[email protected]>wrote:


> Simone,
>
> There is a trade-off in the use of filters - more memory consumed but
> faster performance.  It's a good idea to test both approaches (filter
> vs. Boolean clause) to find what works best for you.
>
> Each filter will consume 1 bit in memory for each document plus the
> overhead of the object itself.  With 50K documents, each filter would
> consume approximately 6.5KB of memory, with 1500 consuming approximately
> 10MB.  That's not a whole lot of memory, and will give you a bump in
> search performance.  How much of a bump, you'll have to test to
> determine.  I don't have much experience with indexes of you size,
> however for the large indexes we work with the gain can be significant.
>
> Michael
>
> -----Original Message-----
> From: Simone Chiaretta [mailto:[email protected]]
> Sent: Wednesday, January 06, 2010 11:46 AM
> To: [email protected]
> Subject: Re: Possible memory leak in Lucene.NET 2.4?
>
> Michel,
> great suggestions thank you
>
> More below
>
> On Wed, Jan 6, 2010 at 8:29 PM, Michael Garski
> <[email protected]>wrote:
>
> > Simone -
> >
> > Filters will provide for more efficient queries in your case if you
> > filter on the blog id rather than using it as a query clause as the
> > filter can be cached and re-used for future queries.  Be sure to use
> the
> > FilterManager to ensure your filters are being cached and not
> re-created
> > for each query.
> >
>
> It makes sense when I've just one blog... but can I have, let's say,
> 1500
> different filters one per blog?
> What I need is filtering blog posts based on the blog I'm currently in:
> so I've to search over blog 1 if I'm in blog 1, and in blog 1500 if I'm
> in
> blog 1500.
> This will mean I'll have to cache 1500 different filters? Or in this
> case a
> simple plain query will be better?
>
>
>
> > Optimizing on startup could delay the app pool being available.  I'd
> > suggest that rather than optimizing on a periodic basis that you
> create
> > a custom MergePolicy to control the number of segments in the index
> and
> > when segments are merged.  With 2.9 I take this approach and don't use
> > the Optimize call at all anymore.  Provided you don't have thousands
> of
> > segments, a multi-segment index should not pose a performance issue.
> > There are quite a number of other performance improvements that can be
> > made that have a bigger impact, such as filters.
> >
>
> I'll take a look at that.
>
>
>
> >
> > The 2.9 version I am using is the current trunk version.  It's very
> > stable and I have not encountered any issues.
> >
>
> Ok.. great...thx
>
>
> >
> > Retrieving the IndexReader from the IndexWriter will give you the most
> > recent changes, including those that have not yet been committed.
> >
> http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/index/Inde
> > xWriter.html#getReader%28%29
> >
> >
> OK... 2.9 seems to be solving most of my problems :)
>
>
> > Unfortunately I don't have the bandwidth at this time to help you out
> > with a code review, but should you have any questions, continue to
> post
> > them to the list and I'll provide some feedback as time allows.
> >
>
> Ok.. no problem... thank you anyway.. you are being very helpful
> Simo
>
>
> >
> > Michael
> >
> > -----Original Message-----
> > From: Simone Chiaretta [mailto:[email protected]]
> > Sent: Wednesday, January 06, 2010 11:11 AM
> > To: [email protected]
> > Subject: Re: Possible memory leak in Lucene.NET 2.4?
> >
> > Hi Michael,
> >
> > more below
> >
> > On Wed, Jan 6, 2010 at 7:41 PM, Michael Garski
> > <[email protected]>wrote:
> >
> > > Simone,
> > >
> > > Filters work to constrain the query to the subset of documents that
> > are
> > > contained in the filter, which can improve performance.
> >
> >
> > Ok, from what I see, filtering can help me filter out posts from other
> > blogs.
> > But can filters change with every query?
> > What's the difference between:
> > query for "xyz" on blog 1 over all index
> > vs.
> > query for "xyz" over the index filtered by blog 1?
> >
> >
> > > The field cache
> > > is used to cache values if you are sorting by something other than
> the
> > > score, such as by date or some other value in the index.
> > >
> >
> >
> > I'm just sorting by score... so probably not needed
> >
> >
> > >
> > > Optimizing after each document incurs an unnecessary overhead as all
> > > segments are merged into one, which is not necessary even in
> versions
> > > prior to 2.9.
> > >
> >
> > Great, thank you... I can remove this, would help speed up the add
> > document
> > procedure on large indexes...
> > and since in web app the pool recycles anyway every day or so, doing
> an
> > optimize at the creation of the index write will be enough, correct?
> >
> >
> > > If your app has not yet been released, I would suggest using 2.9 and
> > > ensuring you are not using any methods or properties marked with the
> > > Obsolete attribute to streamline migration to future versions.
> >
> >
> > Great... thank you again... is 2.9 the trunk, right? I don't see a tag
> > for
> > it in SVN
> >
> >
> > > Another
> > > change in 2.9 you could take advantage of is retrieving an
> IndexReader
> > > from the IndexWriter through the GetReader method, which will save
> you
> > > from having to have both a writer and a reader in application scope.
> > > The writer could be held at the application level and the reader
> > > retrieved from it directly.
> > >
> >
> > And that will give the most current reader updated with the latest new
> > docs?
> >
> > One last thing:
> > Would you be so kind (if you have time, and with the proper credit
> given
> > in
> > the source code and in the release notes) to do a kind of source code
> > review
> > to the search engine of the blog?
> > Thx
> >
> > Simone
> >
> >
> > >
> > > Michael
> > >
> > >
> > > -----Original Message-----
> > > From: Simone Chiaretta [mailto:[email protected]]
> > > Sent: Wednesday, January 06, 2010 10:28 AM
> > > To: [email protected]
> > > Subject: Re: Possible memory leak in Lucene.NET 2.4?
> > >
> > > I'm just using queries... I'm pretty new to Lucene, so I went for
> the
> > > easier
> > > solution.
> > > Would you recommend using filters and caching instead of queries?
> > >
> > > At the moment I'm on Lucene 2.3.1... would you recommend moving to
> > 2.9?
> > > My app has not been released yet (an open source blogging engine),
> but
> > > will
> > > be shortly.
> > > The number of documents indexed will range from 0 to 50.000 blog
> posts
> > > (our
> > > biggest installation atm).
> > >
> > > Will not optimizing after every new document reduce the performances
> > of
> > > the
> > > searches on such indexes?
> > >
> > > Simone
> > >
> > > On Wed, Jan 6, 2010 at 7:08 PM, Michael Garski
> > > <[email protected]>wrote:
> > >
> > > > Simone,
> > > >
> > > > Are you using any field caches or filters?
> > > >
> > > > In versions prior to 2.9, reopening the index will completely
> > rebuild
> > > > the field cache and filter bits for all documents in the index,
> > which
> > > > can result in an increase in memory consumption.  In 2.9 and
> future
> > > > versions, the field cache and filter bits are cached at a segment
> > > level,
> > > > which results in significantly faster re-opens as only the new
> > > segments
> > > > are loaded into the caches.
> > > >
> > > > Our applications use very large indexes and 2.9's segment level
> > > caching
> > > > allows us to re-open indexes much faster while utilizing less
> memory
> > > in
> > > > the process.
> > > >
> > > > Michael
> > > >
> > > > -----Original Message-----
> > > > From: Simone Chiaretta [mailto:[email protected]]
> > > > Sent: Wednesday, January 06, 2010 10:01 AM
> > > > To: [email protected]
> > > > Subject: Re: Possible memory leak in Lucene.NET 2.4?
> > > >
> > > > What I am doing is initializing the writer in the App_Start event
> of
> > > the
> > > > web
> > > > app, and closing everything at the App_End event.
> > > > For the reader, I start it at the first search request, re-open it
> > > > everytime
> > > > a new document is added, and then closing it in the App_End
> > > >
> > > > If you are interested here is the search engine service I'm using:
> > > >
> > >
> >
> http://code.google.com/p/subtext/source/browse/trunk/src/Subtext.Framewo
> > > >
> > >
> >
> rk/Services/SearchEngine/SearchEngineService.cs<http://code.google.com/p
> > >
> >
> /subtext/source/browse/trunk/src/Subtext.Framewo%0Ark/Services/SearchEng
> >
> <http://code.google.com/p%0A/subtext/source/browse/trunk/src/Subtext.Fra
> > mewo%0Ark/Services/SearchEng>
> > > ine/SearchEngineService.cs>
> > > >
> > > > Simone
> > > >
> > > > On Wed, Jan 6, 2010 at 6:31 PM, Matt Honeycutt
> > > > <[email protected]>wrote:
> > > >
> > > > > Won't the various global application events be fired if the app
> > pool
> > > > is
> > > > > gracefully terminated/recycled?  While not ideal, couldn't you
> > > > initialize
> > > > > your Lucene objects during one of the application
> initialization,
> > > then
> > > > > dispose of them in the corresponding shutodwn events?
> > > > >
> > > > > On Wed, Jan 6, 2010 at 11:14 AM, Michael Garski
> > > > <[email protected]
> > > > > >wrote:
> > > > >
> > > > > > If it's not an option to create search functionality in a
> > separate
> > > > > process,
> > > > > > such as in a shared hosting environment, you may be limited in
> > the
> > > > size
> > > > > of
> > > > > > your index and how you query it.  The field cache, and to a
> > lesser
> > > > extent
> > > > > > filters, will consume a fair amount of memory that is
> > proportional
> > > > to the
> > > > > > number of documents in the index.
> > > > > >
> > > > > > As others have mentioned, you will have to ensure that
> resources
> > > are
> > > > > > released when the app pool recycles.
> > > > > >
> > > > > > Michael
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Simone Chiaretta [mailto:[email protected]]
> > > > > > Sent: Wednesday, January 06, 2010 12:45 AM
> > > > > > To: [email protected]
> > > > > > Subject: Re: Possible memory leak in Lucene.NET 2.4?
> > > > > >
> > > > > > Unfortunately not everybody can use another process: I'm
> > building
> > > a
> > > > > > blog engine that must be able to run on shared hosting
> provider.
> > > The
> > > > > > 2nd process is not an option :)
> > > > > >
> > > > > > Simone
> > > > > >
> > > > > > On Tuesday, January 5, 2010, Digy <[email protected]> wrote:
> > > > > > > As Michael stated, I prefer also not hosting "indexing and
> > > > searching
> > > > > > > sevices" in IIS.
> > > > > > > There are many alternatives such as WCF, Remoting etc. With
> a
> > > > separate
> > > > > > > service for Lucene, you can control anything you want.
> > > > > > >
> > > > > > > DIGY
> > > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Michael Garski [mailto:[email protected]]
> > > > > > > Sent: Tuesday, January 05, 2010 11:11 PM
> > > > > > > To: [email protected]
> > > > > > > Subject: RE: Possible memory leak in Lucene.NET 2.4?
> > > > > > >
> > > > > > > Jeff,
> > > > > > >
> > > > > > > Correct - there is no need to optimize the index after
> adding
> > a
> > > > > > > document, and I would recommend against it especially when
> you
> > > > move to
> > > > > > > 2.9 as you will not see any of the benefits of the changes
> to
> > > > composite
> > > > > > > readers such as faster incremental warm-ups to filters and
> > field
> > > > > caches.
> > > > > > >
> > > > > > > I've never run Lucene.Net in the context of a web process
> and
> > > > would
> > > > > > > actually recommend against that approach due to app pool
> > > > recycling,
> > > > > > > opting for a service that exposed search functionality via
> > WCF.
> > > > > > >
> > > > > > > What types of queries are you executing? Are you using
> filters
> > > or
> > > > > > > sorting?  How often do you re-open the IndexReader that is
> > used
> > > > for
> > > > > > > searching?  Re-opening the reader after each document
> addition
> > > can
> > > > be
> > > > > an
> > > > > > > expensive process, especially if you are using filters
> and/or
> > > > sorts.
> > > > > > > How are you refreshing the IndexReader?
> > > > > > >
> > > > > > > Regarding the IndexReader locking files, this is a feature
> > which
> > > > allows
> > > > > > > you to concurrently index and search on the same index and
> not
> > > > have to
> > > > > > > worry about the IndexWriter deleting a segment file from
> > > > underneath the
> > > > > > > searcher when a segment merge occurs.
> > > > > > >
> > > > > > > The first place to look would be to use a memory profiler to
> > > > determine
> > > > > > > what is actually consuming the memory.  I use the SciTech
> .NET
> > > > Memory
> > > > > > > Profiler for such purposes.
> > > > > > >
> > > > > > > Michael
> > > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Jeff Pennal [mailto:[email protected]]
> > > > > > > Sent: Tuesday, January 05, 2010 12:42 PM
> > > > > > > To: [email protected]
> > > > > > > Subject: Possible memory leak in Lucene.NET 2.4?
> > > > > > >
> > > > > > > Hello all,
> > > > > > >
> > > > > > > In doing some profiling of our Lucene code, I noticed that
> we
> > > were
> > > > > doing
> > > > > > >
> > > > > > > an optimize code after every update to our index. Though our
> > > index
> > > > is
> > > > > > > relatively small (~75MB), the optimize task still look way
> to
> > > much
> > > > time
> > > > > > > to run.
> > > > > > >
> > > > > > > I did some research and it seems like it would not be an
> issue
> > > to
> > > > > update
> > > > > > >
> > > > > > > our index without optimizing afterwords, the side effect
> being
> > > > that
> > > > > we'd
> > > > > > >
> > > > > > > have more open file handles.
> > > > > > >
> > > > > > > I made that change and noticed some horrible performance
> side
> > > > effects.
> > > > > > >
> > > > > > > The first thing I noticed was that the CPU for our web
> > > application
> > > > > > > (ASP.NET MVC) that read from the Index never went below
> 60-70%
> > > and
> > > > was
> > > > > > > frequently pegged at 99%.
> > > > > > >
> > > > > > > In addition to the CPU spiking, the memory taken up by the
> > > > w3wp.exe
> > > > > > > process quickly grew to around 800MB, which is about 300MB
> > above
> > > > > normal.
> > > > > > >
> > > > > > > This has all the hallmarks of a memory leak somewhere.
> > > > > > >
> > > > > > > Finally, I noticed that the IndexReader was locking some of
> > the
> > > > files
> > > > > in
> > > > > > >
> > > > > > > the index folder even though the reader was set to nolock
> > mode.
> > > > This
> > > > > > > seemed to be cause of the increase in the number of files in
> > the
> > > > index
> > > > > > > folder.
> > > > > > >
> > > > > > > We have the IndexReader set to open once and then be shared
> > > among
> > > > every
> > > > > > > request to the web application. My understanding is that
> this
> > is
> > > > the
> > > > > > > correct way to do this, and this never caused and issues
> when
> > we
> > > > were
> > > > > > > optimizing the index after every update.
> > > > > > >
> > > > > > > I know this is a pretty vague problem and there could be any
> > > > number of
> > > > > > > issues involved here. However, if anyone could suggest areas
> > to
> > > > look
> > > > > > > into for possible solutions, it would be greatly
> appreciated.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Jeff
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > --
> > > > > > Simone Chiaretta
> > > > > > Microsoft MVP ASP.NET - ASPInsider
> > > > > > Blog: http://codeclimber.net.nz
> > > > > > RSS: http://feeds2.feedburner.com/codeclimber
> > > > > > twitter: @simonech
> > > > > >
> > > > > > Any sufficiently advanced technology is indistinguishable from
> > > magic
> > > > > > "Life is short, play hard"
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Simone Chiaretta
> > > > Microsoft MVP ASP.NET - ASPInsider
> > > > Blog: http://codeclimber.net.nz
> > > > RSS: http://feeds2.feedburner.com/codeclimber
> > > > twitter: @simonech
> > > >
> > > > Any sufficiently advanced technology is indistinguishable from
> magic
> > > > "Life is short, play hard"
> > > >
> > > >
> > >
> > >
> > > --
> > > Simone Chiaretta
> > > Microsoft MVP ASP.NET - ASPInsider
> > > Blog: http://codeclimber.net.nz
> > > RSS: http://feeds2.feedburner.com/codeclimber
> > > twitter: @simonech
> > >
> > > Any sufficiently advanced technology is indistinguishable from magic
> > > "Life is short, play hard"
> > >
> > >
> >
> >
> > --
> > Simone Chiaretta
> > Microsoft MVP ASP.NET - ASPInsider
> > Blog: http://codeclimber.net.nz
> > RSS: http://feeds2.feedburner.com/codeclimber
> > twitter: @simonech
> >
> > Any sufficiently advanced technology is indistinguishable from magic
> > "Life is short, play hard"
> >
> >
>
>
> --
> Simone Chiaretta
> Microsoft MVP ASP.NET - ASPInsider
> Blog: http://codeclimber.net.nz
> RSS: http://feeds2.feedburner.com/codeclimber
> twitter: @simonech
>
> Any sufficiently advanced technology is indistinguishable from magic
> "Life is short, play hard"
>
>


-- 
Simone Chiaretta
Microsoft MVP ASP.NET - ASPInsider
Blog: http://codeclimber.net.nz
RSS: http://feeds2.feedburner.com/codeclimber
twitter: @simonech

Any sufficiently advanced technology is indistinguishable from magic
"Life is short, play hard"

Re: Possible memory leak in Lucene.NET 2.4?

Reply via email to