OK... thank you I'll try Simo On Wed, Jan 6, 2010 at 8:57 PM, Michael Garski <[email protected]>wrote:
> Simone, > > There is a trade-off in the use of filters - more memory consumed but > faster performance. It's a good idea to test both approaches (filter > vs. Boolean clause) to find what works best for you. > > Each filter will consume 1 bit in memory for each document plus the > overhead of the object itself. With 50K documents, each filter would > consume approximately 6.5KB of memory, with 1500 consuming approximately > 10MB. That's not a whole lot of memory, and will give you a bump in > search performance. How much of a bump, you'll have to test to > determine. I don't have much experience with indexes of you size, > however for the large indexes we work with the gain can be significant. > > Michael > > -----Original Message----- > From: Simone Chiaretta [mailto:[email protected]] > Sent: Wednesday, January 06, 2010 11:46 AM > To: [email protected] > Subject: Re: Possible memory leak in Lucene.NET 2.4? > > Michel, > great suggestions thank you > > More below > > On Wed, Jan 6, 2010 at 8:29 PM, Michael Garski > <[email protected]>wrote: > > > Simone - > > > > Filters will provide for more efficient queries in your case if you > > filter on the blog id rather than using it as a query clause as the > > filter can be cached and re-used for future queries. Be sure to use > the > > FilterManager to ensure your filters are being cached and not > re-created > > for each query. > > > > It makes sense when I've just one blog... but can I have, let's say, > 1500 > different filters one per blog? > What I need is filtering blog posts based on the blog I'm currently in: > so I've to search over blog 1 if I'm in blog 1, and in blog 1500 if I'm > in > blog 1500. > This will mean I'll have to cache 1500 different filters? Or in this > case a > simple plain query will be better? > > > > > Optimizing on startup could delay the app pool being available. I'd > > suggest that rather than optimizing on a periodic basis that you > create > > a custom MergePolicy to control the number of segments in the index > and > > when segments are merged. With 2.9 I take this approach and don't use > > the Optimize call at all anymore. Provided you don't have thousands > of > > segments, a multi-segment index should not pose a performance issue. > > There are quite a number of other performance improvements that can be > > made that have a bigger impact, such as filters. > > > > I'll take a look at that. > > > > > > > The 2.9 version I am using is the current trunk version. It's very > > stable and I have not encountered any issues. > > > > Ok.. great...thx > > > > > > Retrieving the IndexReader from the IndexWriter will give you the most > > recent changes, including those that have not yet been committed. > > > http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/index/Inde > > xWriter.html#getReader%28%29 > > > > > OK... 2.9 seems to be solving most of my problems :) > > > > Unfortunately I don't have the bandwidth at this time to help you out > > with a code review, but should you have any questions, continue to > post > > them to the list and I'll provide some feedback as time allows. > > > > Ok.. no problem... thank you anyway.. you are being very helpful > Simo > > > > > > Michael > > > > -----Original Message----- > > From: Simone Chiaretta [mailto:[email protected]] > > Sent: Wednesday, January 06, 2010 11:11 AM > > To: [email protected] > > Subject: Re: Possible memory leak in Lucene.NET 2.4? > > > > Hi Michael, > > > > more below > > > > On Wed, Jan 6, 2010 at 7:41 PM, Michael Garski > > <[email protected]>wrote: > > > > > Simone, > > > > > > Filters work to constrain the query to the subset of documents that > > are > > > contained in the filter, which can improve performance. > > > > > > Ok, from what I see, filtering can help me filter out posts from other > > blogs. > > But can filters change with every query? > > What's the difference between: > > query for "xyz" on blog 1 over all index > > vs. > > query for "xyz" over the index filtered by blog 1? > > > > > > > The field cache > > > is used to cache values if you are sorting by something other than > the > > > score, such as by date or some other value in the index. > > > > > > > > > I'm just sorting by score... so probably not needed > > > > > > > > > > Optimizing after each document incurs an unnecessary overhead as all > > > segments are merged into one, which is not necessary even in > versions > > > prior to 2.9. > > > > > > > Great, thank you... I can remove this, would help speed up the add > > document > > procedure on large indexes... > > and since in web app the pool recycles anyway every day or so, doing > an > > optimize at the creation of the index write will be enough, correct? > > > > > > > If your app has not yet been released, I would suggest using 2.9 and > > > ensuring you are not using any methods or properties marked with the > > > Obsolete attribute to streamline migration to future versions. > > > > > > Great... thank you again... is 2.9 the trunk, right? I don't see a tag > > for > > it in SVN > > > > > > > Another > > > change in 2.9 you could take advantage of is retrieving an > IndexReader > > > from the IndexWriter through the GetReader method, which will save > you > > > from having to have both a writer and a reader in application scope. > > > The writer could be held at the application level and the reader > > > retrieved from it directly. > > > > > > > And that will give the most current reader updated with the latest new > > docs? > > > > One last thing: > > Would you be so kind (if you have time, and with the proper credit > given > > in > > the source code and in the release notes) to do a kind of source code > > review > > to the search engine of the blog? > > Thx > > > > Simone > > > > > > > > > > Michael > > > > > > > > > -----Original Message----- > > > From: Simone Chiaretta [mailto:[email protected]] > > > Sent: Wednesday, January 06, 2010 10:28 AM > > > To: [email protected] > > > Subject: Re: Possible memory leak in Lucene.NET 2.4? > > > > > > I'm just using queries... I'm pretty new to Lucene, so I went for > the > > > easier > > > solution. > > > Would you recommend using filters and caching instead of queries? > > > > > > At the moment I'm on Lucene 2.3.1... would you recommend moving to > > 2.9? > > > My app has not been released yet (an open source blogging engine), > but > > > will > > > be shortly. > > > The number of documents indexed will range from 0 to 50.000 blog > posts > > > (our > > > biggest installation atm). > > > > > > Will not optimizing after every new document reduce the performances > > of > > > the > > > searches on such indexes? > > > > > > Simone > > > > > > On Wed, Jan 6, 2010 at 7:08 PM, Michael Garski > > > <[email protected]>wrote: > > > > > > > Simone, > > > > > > > > Are you using any field caches or filters? > > > > > > > > In versions prior to 2.9, reopening the index will completely > > rebuild > > > > the field cache and filter bits for all documents in the index, > > which > > > > can result in an increase in memory consumption. In 2.9 and > future > > > > versions, the field cache and filter bits are cached at a segment > > > level, > > > > which results in significantly faster re-opens as only the new > > > segments > > > > are loaded into the caches. > > > > > > > > Our applications use very large indexes and 2.9's segment level > > > caching > > > > allows us to re-open indexes much faster while utilizing less > memory > > > in > > > > the process. > > > > > > > > Michael > > > > > > > > -----Original Message----- > > > > From: Simone Chiaretta [mailto:[email protected]] > > > > Sent: Wednesday, January 06, 2010 10:01 AM > > > > To: [email protected] > > > > Subject: Re: Possible memory leak in Lucene.NET 2.4? > > > > > > > > What I am doing is initializing the writer in the App_Start event > of > > > the > > > > web > > > > app, and closing everything at the App_End event. > > > > For the reader, I start it at the first search request, re-open it > > > > everytime > > > > a new document is added, and then closing it in the App_End > > > > > > > > If you are interested here is the search engine service I'm using: > > > > > > > > > > http://code.google.com/p/subtext/source/browse/trunk/src/Subtext.Framewo > > > > > > > > > > rk/Services/SearchEngine/SearchEngineService.cs<http://code.google.com/p > > > > > > /subtext/source/browse/trunk/src/Subtext.Framewo%0Ark/Services/SearchEng > > > <http://code.google.com/p%0A/subtext/source/browse/trunk/src/Subtext.Fra > > mewo%0Ark/Services/SearchEng> > > > ine/SearchEngineService.cs> > > > > > > > > Simone > > > > > > > > On Wed, Jan 6, 2010 at 6:31 PM, Matt Honeycutt > > > > <[email protected]>wrote: > > > > > > > > > Won't the various global application events be fired if the app > > pool > > > > is > > > > > gracefully terminated/recycled? While not ideal, couldn't you > > > > initialize > > > > > your Lucene objects during one of the application > initialization, > > > then > > > > > dispose of them in the corresponding shutodwn events? > > > > > > > > > > On Wed, Jan 6, 2010 at 11:14 AM, Michael Garski > > > > <[email protected] > > > > > >wrote: > > > > > > > > > > > If it's not an option to create search functionality in a > > separate > > > > > process, > > > > > > such as in a shared hosting environment, you may be limited in > > the > > > > size > > > > > of > > > > > > your index and how you query it. The field cache, and to a > > lesser > > > > extent > > > > > > filters, will consume a fair amount of memory that is > > proportional > > > > to the > > > > > > number of documents in the index. > > > > > > > > > > > > As others have mentioned, you will have to ensure that > resources > > > are > > > > > > released when the app pool recycles. > > > > > > > > > > > > Michael > > > > > > > > > > > > -----Original Message----- > > > > > > From: Simone Chiaretta [mailto:[email protected]] > > > > > > Sent: Wednesday, January 06, 2010 12:45 AM > > > > > > To: [email protected] > > > > > > Subject: Re: Possible memory leak in Lucene.NET 2.4? > > > > > > > > > > > > Unfortunately not everybody can use another process: I'm > > building > > > a > > > > > > blog engine that must be able to run on shared hosting > provider. > > > The > > > > > > 2nd process is not an option :) > > > > > > > > > > > > Simone > > > > > > > > > > > > On Tuesday, January 5, 2010, Digy <[email protected]> wrote: > > > > > > > As Michael stated, I prefer also not hosting "indexing and > > > > searching > > > > > > > sevices" in IIS. > > > > > > > There are many alternatives such as WCF, Remoting etc. With > a > > > > separate > > > > > > > service for Lucene, you can control anything you want. > > > > > > > > > > > > > > DIGY > > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: Michael Garski [mailto:[email protected]] > > > > > > > Sent: Tuesday, January 05, 2010 11:11 PM > > > > > > > To: [email protected] > > > > > > > Subject: RE: Possible memory leak in Lucene.NET 2.4? > > > > > > > > > > > > > > Jeff, > > > > > > > > > > > > > > Correct - there is no need to optimize the index after > adding > > a > > > > > > > document, and I would recommend against it especially when > you > > > > move to > > > > > > > 2.9 as you will not see any of the benefits of the changes > to > > > > composite > > > > > > > readers such as faster incremental warm-ups to filters and > > field > > > > > caches. > > > > > > > > > > > > > > I've never run Lucene.Net in the context of a web process > and > > > > would > > > > > > > actually recommend against that approach due to app pool > > > > recycling, > > > > > > > opting for a service that exposed search functionality via > > WCF. > > > > > > > > > > > > > > What types of queries are you executing? Are you using > filters > > > or > > > > > > > sorting? How often do you re-open the IndexReader that is > > used > > > > for > > > > > > > searching? Re-opening the reader after each document > addition > > > can > > > > be > > > > > an > > > > > > > expensive process, especially if you are using filters > and/or > > > > sorts. > > > > > > > How are you refreshing the IndexReader? > > > > > > > > > > > > > > Regarding the IndexReader locking files, this is a feature > > which > > > > allows > > > > > > > you to concurrently index and search on the same index and > not > > > > have to > > > > > > > worry about the IndexWriter deleting a segment file from > > > > underneath the > > > > > > > searcher when a segment merge occurs. > > > > > > > > > > > > > > The first place to look would be to use a memory profiler to > > > > determine > > > > > > > what is actually consuming the memory. I use the SciTech > .NET > > > > Memory > > > > > > > Profiler for such purposes. > > > > > > > > > > > > > > Michael > > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: Jeff Pennal [mailto:[email protected]] > > > > > > > Sent: Tuesday, January 05, 2010 12:42 PM > > > > > > > To: [email protected] > > > > > > > Subject: Possible memory leak in Lucene.NET 2.4? > > > > > > > > > > > > > > Hello all, > > > > > > > > > > > > > > In doing some profiling of our Lucene code, I noticed that > we > > > were > > > > > doing > > > > > > > > > > > > > > an optimize code after every update to our index. Though our > > > index > > > > is > > > > > > > relatively small (~75MB), the optimize task still look way > to > > > much > > > > time > > > > > > > to run. > > > > > > > > > > > > > > I did some research and it seems like it would not be an > issue > > > to > > > > > update > > > > > > > > > > > > > > our index without optimizing afterwords, the side effect > being > > > > that > > > > > we'd > > > > > > > > > > > > > > have more open file handles. > > > > > > > > > > > > > > I made that change and noticed some horrible performance > side > > > > effects. > > > > > > > > > > > > > > The first thing I noticed was that the CPU for our web > > > application > > > > > > > (ASP.NET MVC) that read from the Index never went below > 60-70% > > > and > > > > was > > > > > > > frequently pegged at 99%. > > > > > > > > > > > > > > In addition to the CPU spiking, the memory taken up by the > > > > w3wp.exe > > > > > > > process quickly grew to around 800MB, which is about 300MB > > above > > > > > normal. > > > > > > > > > > > > > > This has all the hallmarks of a memory leak somewhere. > > > > > > > > > > > > > > Finally, I noticed that the IndexReader was locking some of > > the > > > > files > > > > > in > > > > > > > > > > > > > > the index folder even though the reader was set to nolock > > mode. > > > > This > > > > > > > seemed to be cause of the increase in the number of files in > > the > > > > index > > > > > > > folder. > > > > > > > > > > > > > > We have the IndexReader set to open once and then be shared > > > among > > > > every > > > > > > > request to the web application. My understanding is that > this > > is > > > > the > > > > > > > correct way to do this, and this never caused and issues > when > > we > > > > were > > > > > > > optimizing the index after every update. > > > > > > > > > > > > > > I know this is a pretty vague problem and there could be any > > > > number of > > > > > > > issues involved here. However, if anyone could suggest areas > > to > > > > look > > > > > > > into for possible solutions, it would be greatly > appreciated. > > > > > > > > > > > > > > Thanks, > > > > > > > Jeff > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Simone Chiaretta > > > > > > Microsoft MVP ASP.NET - ASPInsider > > > > > > Blog: http://codeclimber.net.nz > > > > > > RSS: http://feeds2.feedburner.com/codeclimber > > > > > > twitter: @simonech > > > > > > > > > > > > Any sufficiently advanced technology is indistinguishable from > > > magic > > > > > > "Life is short, play hard" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Simone Chiaretta > > > > Microsoft MVP ASP.NET - ASPInsider > > > > Blog: http://codeclimber.net.nz > > > > RSS: http://feeds2.feedburner.com/codeclimber > > > > twitter: @simonech > > > > > > > > Any sufficiently advanced technology is indistinguishable from > magic > > > > "Life is short, play hard" > > > > > > > > > > > > > > > > > -- > > > Simone Chiaretta > > > Microsoft MVP ASP.NET - ASPInsider > > > Blog: http://codeclimber.net.nz > > > RSS: http://feeds2.feedburner.com/codeclimber > > > twitter: @simonech > > > > > > Any sufficiently advanced technology is indistinguishable from magic > > > "Life is short, play hard" > > > > > > > > > > > > -- > > Simone Chiaretta > > Microsoft MVP ASP.NET - ASPInsider > > Blog: http://codeclimber.net.nz > > RSS: http://feeds2.feedburner.com/codeclimber > > twitter: @simonech > > > > Any sufficiently advanced technology is indistinguishable from magic > > "Life is short, play hard" > > > > > > > -- > Simone Chiaretta > Microsoft MVP ASP.NET - ASPInsider > Blog: http://codeclimber.net.nz > RSS: http://feeds2.feedburner.com/codeclimber > twitter: @simonech > > Any sufficiently advanced technology is indistinguishable from magic > "Life is short, play hard" > > -- Simone Chiaretta Microsoft MVP ASP.NET - ASPInsider Blog: http://codeclimber.net.nz RSS: http://feeds2.feedburner.com/codeclimber twitter: @simonech Any sufficiently advanced technology is indistinguishable from magic "Life is short, play hard"
