Re: Possible memory leak in Lucene.NET 2.4?

Simone Chiaretta Thu, 07 Jan 2010 09:01:25 -0800

OK, thank you
Simone

On Thu, Jan 7, 2010 at 5:58 PM, Digy <[email protected]> wrote:


> - Those filters are under contrib section of Lucene Java and they are not
> ported to .NET. So, If you want to use BooleanFilter you have to port it
> yourself.
> - "TermRangeFilter" is available in Java too. It is not mentioned at that
> page since it is not direct subclass of "Filter"(instead it is subclass of
> MultiTermQueryWrapperFilter).
>
> DIGY
>
> -----Original Message-----
> From: Simone Chiaretta [mailto:[email protected]]
> Sent: Thursday, January 07, 2010 12:39 PM
> To: lucene-net-user
> Subject: Re: Possible memory leak in Lucene.NET 2.4?
>
> I'm having a look at the filters:
> the java version of Lucene has quite a few filters implemented:
>
> http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/search/Filter.
> html<http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/search/Filter.%0Ahtml>
> like the booleanfilter (which I wanted to use in my search)
>
> but looking at the same version in .net I don't see many of them, but I see
> other that are not available in the java version (like the TermRangeFilter)
>
> Why is that?
>
> Simone
>
>
> On Wed, Jan 6, 2010 at 8:57 PM, Michael Garski
> <[email protected]>wrote:
>
> > Simone,
> >
> > There is a trade-off in the use of filters - more memory consumed but
> > faster performance.  It's a good idea to test both approaches (filter
> > vs. Boolean clause) to find what works best for you.
> >
> > Each filter will consume 1 bit in memory for each document plus the
> > overhead of the object itself.  With 50K documents, each filter would
> > consume approximately 6.5KB of memory, with 1500 consuming approximately
> > 10MB.  That's not a whole lot of memory, and will give you a bump in
> > search performance.  How much of a bump, you'll have to test to
> > determine.  I don't have much experience with indexes of you size,
> > however for the large indexes we work with the gain can be significant.
> >
> > Michael
> >
> > -----Original Message-----
> > From: Simone Chiaretta [mailto:[email protected]]
> > Sent: Wednesday, January 06, 2010 11:46 AM
> > To: [email protected]
> > Subject: Re: Possible memory leak in Lucene.NET 2.4?
> >
> > Michel,
> > great suggestions thank you
> >
> > More below
> >
> > On Wed, Jan 6, 2010 at 8:29 PM, Michael Garski
> > <[email protected]>wrote:
> >
> > > Simone -
> > >
> > > Filters will provide for more efficient queries in your case if you
> > > filter on the blog id rather than using it as a query clause as the
> > > filter can be cached and re-used for future queries.  Be sure to use
> > the
> > > FilterManager to ensure your filters are being cached and not
> > re-created
> > > for each query.
> > >
> >
> > It makes sense when I've just one blog... but can I have, let's say,
> > 1500
> > different filters one per blog?
> > What I need is filtering blog posts based on the blog I'm currently in:
> > so I've to search over blog 1 if I'm in blog 1, and in blog 1500 if I'm
> > in
> > blog 1500.
> > This will mean I'll have to cache 1500 different filters? Or in this
> > case a
> > simple plain query will be better?
> >
> >
> >
> > > Optimizing on startup could delay the app pool being available.  I'd
> > > suggest that rather than optimizing on a periodic basis that you
> > create
> > > a custom MergePolicy to control the number of segments in the index
> > and
> > > when segments are merged.  With 2.9 I take this approach and don't use
> > > the Optimize call at all anymore.  Provided you don't have thousands
> > of
> > > segments, a multi-segment index should not pose a performance issue.
> > > There are quite a number of other performance improvements that can be
> > > made that have a bigger impact, such as filters.
> > >
> >
> > I'll take a look at that.
> >
> >
> >
> > >
> > > The 2.9 version I am using is the current trunk version.  It's very
> > > stable and I have not encountered any issues.
> > >
> >
> > Ok.. great...thx
> >
> >
> > >
> > > Retrieving the IndexReader from the IndexWriter will give you the most
> > > recent changes, including those that have not yet been committed.
> > >
> > http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/index/Inde
> > > xWriter.html#getReader%28%29
> > >
> > >
> > OK... 2.9 seems to be solving most of my problems :)
> >
> >
> > > Unfortunately I don't have the bandwidth at this time to help you out
> > > with a code review, but should you have any questions, continue to
> > post
> > > them to the list and I'll provide some feedback as time allows.
> > >
> >
> > Ok.. no problem... thank you anyway.. you are being very helpful
> > Simo
> >
> >
> > >
> > > Michael
> > >
> > > -----Original Message-----
> > > From: Simone Chiaretta [mailto:[email protected]]
> > > Sent: Wednesday, January 06, 2010 11:11 AM
> > > To: [email protected]
> > > Subject: Re: Possible memory leak in Lucene.NET 2.4?
> > >
> > > Hi Michael,
> > >
> > > more below
> > >
> > > On Wed, Jan 6, 2010 at 7:41 PM, Michael Garski
> > > <[email protected]>wrote:
> > >
> > > > Simone,
> > > >
> > > > Filters work to constrain the query to the subset of documents that
> > > are
> > > > contained in the filter, which can improve performance.
> > >
> > >
> > > Ok, from what I see, filtering can help me filter out posts from other
> > > blogs.
> > > But can filters change with every query?
> > > What's the difference between:
> > > query for "xyz" on blog 1 over all index
> > > vs.
> > > query for "xyz" over the index filtered by blog 1?
> > >
> > >
> > > > The field cache
> > > > is used to cache values if you are sorting by something other than
> > the
> > > > score, such as by date or some other value in the index.
> > > >
> > >
> > >
> > > I'm just sorting by score... so probably not needed
> > >
> > >
> > > >
> > > > Optimizing after each document incurs an unnecessary overhead as all
> > > > segments are merged into one, which is not necessary even in
> > versions
> > > > prior to 2.9.
> > > >
> > >
> > > Great, thank you... I can remove this, would help speed up the add
> > > document
> > > procedure on large indexes...
> > > and since in web app the pool recycles anyway every day or so, doing
> > an
> > > optimize at the creation of the index write will be enough, correct?
> > >
> > >
> > > > If your app has not yet been released, I would suggest using 2.9 and
> > > > ensuring you are not using any methods or properties marked with the
> > > > Obsolete attribute to streamline migration to future versions.
> > >
> > >
> > > Great... thank you again... is 2.9 the trunk, right? I don't see a tag
> > > for
> > > it in SVN
> > >
> > >
> > > > Another
> > > > change in 2.9 you could take advantage of is retrieving an
> > IndexReader
> > > > from the IndexWriter through the GetReader method, which will save
> > you
> > > > from having to have both a writer and a reader in application scope.
> > > > The writer could be held at the application level and the reader
> > > > retrieved from it directly.
> > > >
> > >
> > > And that will give the most current reader updated with the latest new
> > > docs?
> > >
> > > One last thing:
> > > Would you be so kind (if you have time, and with the proper credit
> > given
> > > in
> > > the source code and in the release notes) to do a kind of source code
> > > review
> > > to the search engine of the blog?
> > > Thx
> > >
> > > Simone
> > >
> > >
> > > >
> > > > Michael
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Simone Chiaretta [mailto:[email protected]]
> > > > Sent: Wednesday, January 06, 2010 10:28 AM
> > > > To: [email protected]
> > > > Subject: Re: Possible memory leak in Lucene.NET 2.4?
> > > >
> > > > I'm just using queries... I'm pretty new to Lucene, so I went for
> > the
> > > > easier
> > > > solution.
> > > > Would you recommend using filters and caching instead of queries?
> > > >
> > > > At the moment I'm on Lucene 2.3.1... would you recommend moving to
> > > 2.9?
> > > > My app has not been released yet (an open source blogging engine),
> > but
> > > > will
> > > > be shortly.
> > > > The number of documents indexed will range from 0 to 50.000 blog
> > posts
> > > > (our
> > > > biggest installation atm).
> > > >
> > > > Will not optimizing after every new document reduce the performances
> > > of
> > > > the
> > > > searches on such indexes?
> > > >
> > > > Simone
> > > >
> > > > On Wed, Jan 6, 2010 at 7:08 PM, Michael Garski
> > > > <[email protected]>wrote:
> > > >
> > > > > Simone,
> > > > >
> > > > > Are you using any field caches or filters?
> > > > >
> > > > > In versions prior to 2.9, reopening the index will completely
> > > rebuild
> > > > > the field cache and filter bits for all documents in the index,
> > > which
> > > > > can result in an increase in memory consumption.  In 2.9 and
> > future
> > > > > versions, the field cache and filter bits are cached at a segment
> > > > level,
> > > > > which results in significantly faster re-opens as only the new
> > > > segments
> > > > > are loaded into the caches.
> > > > >
> > > > > Our applications use very large indexes and 2.9's segment level
> > > > caching
> > > > > allows us to re-open indexes much faster while utilizing less
> > memory
> > > > in
> > > > > the process.
> > > > >
> > > > > Michael
> > > > >
> > > > > -----Original Message-----
> > > > > From: Simone Chiaretta [mailto:[email protected]]
> > > > > Sent: Wednesday, January 06, 2010 10:01 AM
> > > > > To: [email protected]
> > > > > Subject: Re: Possible memory leak in Lucene.NET 2.4?
> > > > >
> > > > > What I am doing is initializing the writer in the App_Start event
> > of
> > > > the
> > > > > web
> > > > > app, and closing everything at the App_End event.
> > > > > For the reader, I start it at the first search request, re-open it
> > > > > everytime
> > > > > a new document is added, and then closing it in the App_End
> > > > >
> > > > > If you are interested here is the search engine service I'm using:
> > > > >
> > > >
> > >
> > http://code.google.com/p/subtext/source/browse/trunk/src/Subtext.Framewo
> > > > >
> > > >
> > >
> > rk/Services/SearchEngine/SearchEngineService.cs<http://code.google.com/p
> > > >
> > >
> > /subtext/source/browse/trunk/src/Subtext.Framewo%0Ark/Services/SearchEng
> > >
> > <http://code.google.com/p%0A/subtext/source/browse/trunk/src/Subtext.Fra
> > > mewo%0Ark/Services/SearchEng>
> > > > ine/SearchEngineService.cs>
> > > > >
> > > > > Simone
> > > > >
> > > > > On Wed, Jan 6, 2010 at 6:31 PM, Matt Honeycutt
> > > > > <[email protected]>wrote:
> > > > >
> > > > > > Won't the various global application events be fired if the app
> > > pool
> > > > > is
> > > > > > gracefully terminated/recycled?  While not ideal, couldn't you
> > > > > initialize
> > > > > > your Lucene objects during one of the application
> > initialization,
> > > > then
> > > > > > dispose of them in the corresponding shutodwn events?
> > > > > >
> > > > > > On Wed, Jan 6, 2010 at 11:14 AM, Michael Garski
> > > > > <[email protected]
> > > > > > >wrote:
> > > > > >
> > > > > > > If it's not an option to create search functionality in a
> > > separate
> > > > > > process,
> > > > > > > such as in a shared hosting environment, you may be limited in
> > > the
> > > > > size
> > > > > > of
> > > > > > > your index and how you query it.  The field cache, and to a
> > > lesser
> > > > > extent
> > > > > > > filters, will consume a fair amount of memory that is
> > > proportional
> > > > > to the
> > > > > > > number of documents in the index.
> > > > > > >
> > > > > > > As others have mentioned, you will have to ensure that
> > resources
> > > > are
> > > > > > > released when the app pool recycles.
> > > > > > >
> > > > > > > Michael
> > > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Simone Chiaretta [mailto:[email protected]]
> > > > > > > Sent: Wednesday, January 06, 2010 12:45 AM
> > > > > > > To: [email protected]
> > > > > > > Subject: Re: Possible memory leak in Lucene.NET 2.4?
> > > > > > >
> > > > > > > Unfortunately not everybody can use another process: I'm
> > > building
> > > > a
> > > > > > > blog engine that must be able to run on shared hosting
> > provider.
> > > > The
> > > > > > > 2nd process is not an option :)
> > > > > > >
> > > > > > > Simone
> > > > > > >
> > > > > > > On Tuesday, January 5, 2010, Digy <[email protected]> wrote:
> > > > > > > > As Michael stated, I prefer also not hosting "indexing and
> > > > > searching
> > > > > > > > sevices" in IIS.
> > > > > > > > There are many alternatives such as WCF, Remoting etc. With
> > a
> > > > > separate
> > > > > > > > service for Lucene, you can control anything you want.
> > > > > > > >
> > > > > > > > DIGY
> > > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Michael Garski [mailto:[email protected]]
> > > > > > > > Sent: Tuesday, January 05, 2010 11:11 PM
> > > > > > > > To: [email protected]
> > > > > > > > Subject: RE: Possible memory leak in Lucene.NET 2.4?
> > > > > > > >
> > > > > > > > Jeff,
> > > > > > > >
> > > > > > > > Correct - there is no need to optimize the index after
> > adding
> > > a
> > > > > > > > document, and I would recommend against it especially when
> > you
> > > > > move to
> > > > > > > > 2.9 as you will not see any of the benefits of the changes
> > to
> > > > > composite
> > > > > > > > readers such as faster incremental warm-ups to filters and
> > > field
> > > > > > caches.
> > > > > > > >
> > > > > > > > I've never run Lucene.Net in the context of a web process
> > and
> > > > > would
> > > > > > > > actually recommend against that approach due to app pool
> > > > > recycling,
> > > > > > > > opting for a service that exposed search functionality via
> > > WCF.
> > > > > > > >
> > > > > > > > What types of queries are you executing? Are you using
> > filters
> > > > or
> > > > > > > > sorting?  How often do you re-open the IndexReader that is
> > > used
> > > > > for
> > > > > > > > searching?  Re-opening the reader after each document
> > addition
> > > > can
> > > > > be
> > > > > > an
> > > > > > > > expensive process, especially if you are using filters
> > and/or
> > > > > sorts.
> > > > > > > > How are you refreshing the IndexReader?
> > > > > > > >
> > > > > > > > Regarding the IndexReader locking files, this is a feature
> > > which
> > > > > allows
> > > > > > > > you to concurrently index and search on the same index and
> > not
> > > > > have to
> > > > > > > > worry about the IndexWriter deleting a segment file from
> > > > > underneath the
> > > > > > > > searcher when a segment merge occurs.
> > > > > > > >
> > > > > > > > The first place to look would be to use a memory profiler to
> > > > > determine
> > > > > > > > what is actually consuming the memory.  I use the SciTech
> > .NET
> > > > > Memory
> > > > > > > > Profiler for such purposes.
> > > > > > > >
> > > > > > > > Michael
> > > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Jeff Pennal [mailto:[email protected]]
> > > > > > > > Sent: Tuesday, January 05, 2010 12:42 PM
> > > > > > > > To: [email protected]
> > > > > > > > Subject: Possible memory leak in Lucene.NET 2.4?
> > > > > > > >
> > > > > > > > Hello all,
> > > > > > > >
> > > > > > > > In doing some profiling of our Lucene code, I noticed that
> > we
> > > > were
> > > > > > doing
> > > > > > > >
> > > > > > > > an optimize code after every update to our index. Though our
> > > > index
> > > > > is
> > > > > > > > relatively small (~75MB), the optimize task still look way
> > to
> > > > much
> > > > > time
> > > > > > > > to run.
> > > > > > > >
> > > > > > > > I did some research and it seems like it would not be an
> > issue
> > > > to
> > > > > > update
> > > > > > > >
> > > > > > > > our index without optimizing afterwords, the side effect
> > being
> > > > > that
> > > > > > we'd
> > > > > > > >
> > > > > > > > have more open file handles.
> > > > > > > >
> > > > > > > > I made that change and noticed some horrible performance
> > side
> > > > > effects.
> > > > > > > >
> > > > > > > > The first thing I noticed was that the CPU for our web
> > > > application
> > > > > > > > (ASP.NET MVC) that read from the Index never went below
> > 60-70%
> > > > and
> > > > > was
> > > > > > > > frequently pegged at 99%.
> > > > > > > >
> > > > > > > > In addition to the CPU spiking, the memory taken up by the
> > > > > w3wp.exe
> > > > > > > > process quickly grew to around 800MB, which is about 300MB
> > > above
> > > > > > normal.
> > > > > > > >
> > > > > > > > This has all the hallmarks of a memory leak somewhere.
> > > > > > > >
> > > > > > > > Finally, I noticed that the IndexReader was locking some of
> > > the
> > > > > files
> > > > > > in
> > > > > > > >
> > > > > > > > the index folder even though the reader was set to nolock
> > > mode.
> > > > > This
> > > > > > > > seemed to be cause of the increase in the number of files in
> > > the
> > > > > index
> > > > > > > > folder.
> > > > > > > >
> > > > > > > > We have the IndexReader set to open once and then be shared
> > > > among
> > > > > every
> > > > > > > > request to the web application. My understanding is that
> > this
> > > is
> > > > > the
> > > > > > > > correct way to do this, and this never caused and issues
> > when
> > > we
> > > > > were
> > > > > > > > optimizing the index after every update.
> > > > > > > >
> > > > > > > > I know this is a pretty vague problem and there could be any
> > > > > number of
> > > > > > > > issues involved here. However, if anyone could suggest areas
> > > to
> > > > > look
> > > > > > > > into for possible solutions, it would be greatly
> > appreciated.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Jeff
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Simone Chiaretta
> > > > > > > Microsoft MVP ASP.NET - ASPInsider
> > > > > > > Blog: http://codeclimber.net.nz
> > > > > > > RSS: http://feeds2.feedburner.com/codeclimber
> > > > > > > twitter: @simonech
> > > > > > >
> > > > > > > Any sufficiently advanced technology is indistinguishable from
> > > > magic
> > > > > > > "Life is short, play hard"
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Simone Chiaretta
> > > > > Microsoft MVP ASP.NET - ASPInsider
> > > > > Blog: http://codeclimber.net.nz
> > > > > RSS: http://feeds2.feedburner.com/codeclimber
> > > > > twitter: @simonech
> > > > >
> > > > > Any sufficiently advanced technology is indistinguishable from
> > magic
> > > > > "Life is short, play hard"
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Simone Chiaretta
> > > > Microsoft MVP ASP.NET - ASPInsider
> > > > Blog: http://codeclimber.net.nz
> > > > RSS: http://feeds2.feedburner.com/codeclimber
> > > > twitter: @simonech
> > > >
> > > > Any sufficiently advanced technology is indistinguishable from magic
> > > > "Life is short, play hard"
> > > >
> > > >
> > >
> > >
> > > --
> > > Simone Chiaretta
> > > Microsoft MVP ASP.NET - ASPInsider
> > > Blog: http://codeclimber.net.nz
> > > RSS: http://feeds2.feedburner.com/codeclimber
> > > twitter: @simonech
> > >
> > > Any sufficiently advanced technology is indistinguishable from magic
> > > "Life is short, play hard"
> > >
> > >
> >
> >
> > --
> > Simone Chiaretta
> > Microsoft MVP ASP.NET - ASPInsider
> > Blog: http://codeclimber.net.nz
> > RSS: http://feeds2.feedburner.com/codeclimber
> > twitter: @simonech
> >
> > Any sufficiently advanced technology is indistinguishable from magic
> > "Life is short, play hard"
> >
> >
>
>
> --
> Simone Chiaretta
> Microsoft MVP ASP.NET - ASPInsider
> Blog: http://codeclimber.net.nz
> RSS: http://feeds2.feedburner.com/codeclimber
> twitter: @simonech
>
> Any sufficiently advanced technology is indistinguishable from magic
> "Life is short, play hard"
>
>


-- 
Simone Chiaretta
Microsoft MVP ASP.NET - ASPInsider
Blog: http://codeclimber.net.nz
RSS: http://feeds2.feedburner.com/codeclimber
twitter: @simonech

Any sufficiently advanced technology is indistinguishable from magic
"Life is short, play hard"

Re: Possible memory leak in Lucene.NET 2.4?

Reply via email to