OK, thank you Simone On Thu, Jan 7, 2010 at 5:58 PM, Digy <[email protected]> wrote:
> - Those filters are under contrib section of Lucene Java and they are not > ported to .NET. So, If you want to use BooleanFilter you have to port it > yourself. > - "TermRangeFilter" is available in Java too. It is not mentioned at that > page since it is not direct subclass of "Filter"(instead it is subclass of > MultiTermQueryWrapperFilter). > > DIGY > > -----Original Message----- > From: Simone Chiaretta [mailto:[email protected]] > Sent: Thursday, January 07, 2010 12:39 PM > To: lucene-net-user > Subject: Re: Possible memory leak in Lucene.NET 2.4? > > I'm having a look at the filters: > the java version of Lucene has quite a few filters implemented: > > http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/search/Filter. > html<http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/search/Filter.%0Ahtml> > like the booleanfilter (which I wanted to use in my search) > > but looking at the same version in .net I don't see many of them, but I see > other that are not available in the java version (like the TermRangeFilter) > > Why is that? > > Simone > > > On Wed, Jan 6, 2010 at 8:57 PM, Michael Garski > <[email protected]>wrote: > > > Simone, > > > > There is a trade-off in the use of filters - more memory consumed but > > faster performance. It's a good idea to test both approaches (filter > > vs. Boolean clause) to find what works best for you. > > > > Each filter will consume 1 bit in memory for each document plus the > > overhead of the object itself. With 50K documents, each filter would > > consume approximately 6.5KB of memory, with 1500 consuming approximately > > 10MB. That's not a whole lot of memory, and will give you a bump in > > search performance. How much of a bump, you'll have to test to > > determine. I don't have much experience with indexes of you size, > > however for the large indexes we work with the gain can be significant. > > > > Michael > > > > -----Original Message----- > > From: Simone Chiaretta [mailto:[email protected]] > > Sent: Wednesday, January 06, 2010 11:46 AM > > To: [email protected] > > Subject: Re: Possible memory leak in Lucene.NET 2.4? > > > > Michel, > > great suggestions thank you > > > > More below > > > > On Wed, Jan 6, 2010 at 8:29 PM, Michael Garski > > <[email protected]>wrote: > > > > > Simone - > > > > > > Filters will provide for more efficient queries in your case if you > > > filter on the blog id rather than using it as a query clause as the > > > filter can be cached and re-used for future queries. Be sure to use > > the > > > FilterManager to ensure your filters are being cached and not > > re-created > > > for each query. > > > > > > > It makes sense when I've just one blog... but can I have, let's say, > > 1500 > > different filters one per blog? > > What I need is filtering blog posts based on the blog I'm currently in: > > so I've to search over blog 1 if I'm in blog 1, and in blog 1500 if I'm > > in > > blog 1500. > > This will mean I'll have to cache 1500 different filters? Or in this > > case a > > simple plain query will be better? > > > > > > > > > Optimizing on startup could delay the app pool being available. I'd > > > suggest that rather than optimizing on a periodic basis that you > > create > > > a custom MergePolicy to control the number of segments in the index > > and > > > when segments are merged. With 2.9 I take this approach and don't use > > > the Optimize call at all anymore. Provided you don't have thousands > > of > > > segments, a multi-segment index should not pose a performance issue. > > > There are quite a number of other performance improvements that can be > > > made that have a bigger impact, such as filters. > > > > > > > I'll take a look at that. > > > > > > > > > > > > The 2.9 version I am using is the current trunk version. It's very > > > stable and I have not encountered any issues. > > > > > > > Ok.. great...thx > > > > > > > > > > Retrieving the IndexReader from the IndexWriter will give you the most > > > recent changes, including those that have not yet been committed. > > > > > http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/index/Inde > > > xWriter.html#getReader%28%29 > > > > > > > > OK... 2.9 seems to be solving most of my problems :) > > > > > > > Unfortunately I don't have the bandwidth at this time to help you out > > > with a code review, but should you have any questions, continue to > > post > > > them to the list and I'll provide some feedback as time allows. > > > > > > > Ok.. no problem... thank you anyway.. you are being very helpful > > Simo > > > > > > > > > > Michael > > > > > > -----Original Message----- > > > From: Simone Chiaretta [mailto:[email protected]] > > > Sent: Wednesday, January 06, 2010 11:11 AM > > > To: [email protected] > > > Subject: Re: Possible memory leak in Lucene.NET 2.4? > > > > > > Hi Michael, > > > > > > more below > > > > > > On Wed, Jan 6, 2010 at 7:41 PM, Michael Garski > > > <[email protected]>wrote: > > > > > > > Simone, > > > > > > > > Filters work to constrain the query to the subset of documents that > > > are > > > > contained in the filter, which can improve performance. > > > > > > > > > Ok, from what I see, filtering can help me filter out posts from other > > > blogs. > > > But can filters change with every query? > > > What's the difference between: > > > query for "xyz" on blog 1 over all index > > > vs. > > > query for "xyz" over the index filtered by blog 1? > > > > > > > > > > The field cache > > > > is used to cache values if you are sorting by something other than > > the > > > > score, such as by date or some other value in the index. > > > > > > > > > > > > > I'm just sorting by score... so probably not needed > > > > > > > > > > > > > > Optimizing after each document incurs an unnecessary overhead as all > > > > segments are merged into one, which is not necessary even in > > versions > > > > prior to 2.9. > > > > > > > > > > Great, thank you... I can remove this, would help speed up the add > > > document > > > procedure on large indexes... > > > and since in web app the pool recycles anyway every day or so, doing > > an > > > optimize at the creation of the index write will be enough, correct? > > > > > > > > > > If your app has not yet been released, I would suggest using 2.9 and > > > > ensuring you are not using any methods or properties marked with the > > > > Obsolete attribute to streamline migration to future versions. > > > > > > > > > Great... thank you again... is 2.9 the trunk, right? I don't see a tag > > > for > > > it in SVN > > > > > > > > > > Another > > > > change in 2.9 you could take advantage of is retrieving an > > IndexReader > > > > from the IndexWriter through the GetReader method, which will save > > you > > > > from having to have both a writer and a reader in application scope. > > > > The writer could be held at the application level and the reader > > > > retrieved from it directly. > > > > > > > > > > And that will give the most current reader updated with the latest new > > > docs? > > > > > > One last thing: > > > Would you be so kind (if you have time, and with the proper credit > > given > > > in > > > the source code and in the release notes) to do a kind of source code > > > review > > > to the search engine of the blog? > > > Thx > > > > > > Simone > > > > > > > > > > > > > > Michael > > > > > > > > > > > > -----Original Message----- > > > > From: Simone Chiaretta [mailto:[email protected]] > > > > Sent: Wednesday, January 06, 2010 10:28 AM > > > > To: [email protected] > > > > Subject: Re: Possible memory leak in Lucene.NET 2.4? > > > > > > > > I'm just using queries... I'm pretty new to Lucene, so I went for > > the > > > > easier > > > > solution. > > > > Would you recommend using filters and caching instead of queries? > > > > > > > > At the moment I'm on Lucene 2.3.1... would you recommend moving to > > > 2.9? > > > > My app has not been released yet (an open source blogging engine), > > but > > > > will > > > > be shortly. > > > > The number of documents indexed will range from 0 to 50.000 blog > > posts > > > > (our > > > > biggest installation atm). > > > > > > > > Will not optimizing after every new document reduce the performances > > > of > > > > the > > > > searches on such indexes? > > > > > > > > Simone > > > > > > > > On Wed, Jan 6, 2010 at 7:08 PM, Michael Garski > > > > <[email protected]>wrote: > > > > > > > > > Simone, > > > > > > > > > > Are you using any field caches or filters? > > > > > > > > > > In versions prior to 2.9, reopening the index will completely > > > rebuild > > > > > the field cache and filter bits for all documents in the index, > > > which > > > > > can result in an increase in memory consumption. In 2.9 and > > future > > > > > versions, the field cache and filter bits are cached at a segment > > > > level, > > > > > which results in significantly faster re-opens as only the new > > > > segments > > > > > are loaded into the caches. > > > > > > > > > > Our applications use very large indexes and 2.9's segment level > > > > caching > > > > > allows us to re-open indexes much faster while utilizing less > > memory > > > > in > > > > > the process. > > > > > > > > > > Michael > > > > > > > > > > -----Original Message----- > > > > > From: Simone Chiaretta [mailto:[email protected]] > > > > > Sent: Wednesday, January 06, 2010 10:01 AM > > > > > To: [email protected] > > > > > Subject: Re: Possible memory leak in Lucene.NET 2.4? > > > > > > > > > > What I am doing is initializing the writer in the App_Start event > > of > > > > the > > > > > web > > > > > app, and closing everything at the App_End event. > > > > > For the reader, I start it at the first search request, re-open it > > > > > everytime > > > > > a new document is added, and then closing it in the App_End > > > > > > > > > > If you are interested here is the search engine service I'm using: > > > > > > > > > > > > > > http://code.google.com/p/subtext/source/browse/trunk/src/Subtext.Framewo > > > > > > > > > > > > > > rk/Services/SearchEngine/SearchEngineService.cs<http://code.google.com/p > > > > > > > > > /subtext/source/browse/trunk/src/Subtext.Framewo%0Ark/Services/SearchEng > > > > > <http://code.google.com/p%0A/subtext/source/browse/trunk/src/Subtext.Fra > > > mewo%0Ark/Services/SearchEng> > > > > ine/SearchEngineService.cs> > > > > > > > > > > Simone > > > > > > > > > > On Wed, Jan 6, 2010 at 6:31 PM, Matt Honeycutt > > > > > <[email protected]>wrote: > > > > > > > > > > > Won't the various global application events be fired if the app > > > pool > > > > > is > > > > > > gracefully terminated/recycled? While not ideal, couldn't you > > > > > initialize > > > > > > your Lucene objects during one of the application > > initialization, > > > > then > > > > > > dispose of them in the corresponding shutodwn events? > > > > > > > > > > > > On Wed, Jan 6, 2010 at 11:14 AM, Michael Garski > > > > > <[email protected] > > > > > > >wrote: > > > > > > > > > > > > > If it's not an option to create search functionality in a > > > separate > > > > > > process, > > > > > > > such as in a shared hosting environment, you may be limited in > > > the > > > > > size > > > > > > of > > > > > > > your index and how you query it. The field cache, and to a > > > lesser > > > > > extent > > > > > > > filters, will consume a fair amount of memory that is > > > proportional > > > > > to the > > > > > > > number of documents in the index. > > > > > > > > > > > > > > As others have mentioned, you will have to ensure that > > resources > > > > are > > > > > > > released when the app pool recycles. > > > > > > > > > > > > > > Michael > > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: Simone Chiaretta [mailto:[email protected]] > > > > > > > Sent: Wednesday, January 06, 2010 12:45 AM > > > > > > > To: [email protected] > > > > > > > Subject: Re: Possible memory leak in Lucene.NET 2.4? > > > > > > > > > > > > > > Unfortunately not everybody can use another process: I'm > > > building > > > > a > > > > > > > blog engine that must be able to run on shared hosting > > provider. > > > > The > > > > > > > 2nd process is not an option :) > > > > > > > > > > > > > > Simone > > > > > > > > > > > > > > On Tuesday, January 5, 2010, Digy <[email protected]> wrote: > > > > > > > > As Michael stated, I prefer also not hosting "indexing and > > > > > searching > > > > > > > > sevices" in IIS. > > > > > > > > There are many alternatives such as WCF, Remoting etc. With > > a > > > > > separate > > > > > > > > service for Lucene, you can control anything you want. > > > > > > > > > > > > > > > > DIGY > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > From: Michael Garski [mailto:[email protected]] > > > > > > > > Sent: Tuesday, January 05, 2010 11:11 PM > > > > > > > > To: [email protected] > > > > > > > > Subject: RE: Possible memory leak in Lucene.NET 2.4? > > > > > > > > > > > > > > > > Jeff, > > > > > > > > > > > > > > > > Correct - there is no need to optimize the index after > > adding > > > a > > > > > > > > document, and I would recommend against it especially when > > you > > > > > move to > > > > > > > > 2.9 as you will not see any of the benefits of the changes > > to > > > > > composite > > > > > > > > readers such as faster incremental warm-ups to filters and > > > field > > > > > > caches. > > > > > > > > > > > > > > > > I've never run Lucene.Net in the context of a web process > > and > > > > > would > > > > > > > > actually recommend against that approach due to app pool > > > > > recycling, > > > > > > > > opting for a service that exposed search functionality via > > > WCF. > > > > > > > > > > > > > > > > What types of queries are you executing? Are you using > > filters > > > > or > > > > > > > > sorting? How often do you re-open the IndexReader that is > > > used > > > > > for > > > > > > > > searching? Re-opening the reader after each document > > addition > > > > can > > > > > be > > > > > > an > > > > > > > > expensive process, especially if you are using filters > > and/or > > > > > sorts. > > > > > > > > How are you refreshing the IndexReader? > > > > > > > > > > > > > > > > Regarding the IndexReader locking files, this is a feature > > > which > > > > > allows > > > > > > > > you to concurrently index and search on the same index and > > not > > > > > have to > > > > > > > > worry about the IndexWriter deleting a segment file from > > > > > underneath the > > > > > > > > searcher when a segment merge occurs. > > > > > > > > > > > > > > > > The first place to look would be to use a memory profiler to > > > > > determine > > > > > > > > what is actually consuming the memory. I use the SciTech > > .NET > > > > > Memory > > > > > > > > Profiler for such purposes. > > > > > > > > > > > > > > > > Michael > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > From: Jeff Pennal [mailto:[email protected]] > > > > > > > > Sent: Tuesday, January 05, 2010 12:42 PM > > > > > > > > To: [email protected] > > > > > > > > Subject: Possible memory leak in Lucene.NET 2.4? > > > > > > > > > > > > > > > > Hello all, > > > > > > > > > > > > > > > > In doing some profiling of our Lucene code, I noticed that > > we > > > > were > > > > > > doing > > > > > > > > > > > > > > > > an optimize code after every update to our index. Though our > > > > index > > > > > is > > > > > > > > relatively small (~75MB), the optimize task still look way > > to > > > > much > > > > > time > > > > > > > > to run. > > > > > > > > > > > > > > > > I did some research and it seems like it would not be an > > issue > > > > to > > > > > > update > > > > > > > > > > > > > > > > our index without optimizing afterwords, the side effect > > being > > > > > that > > > > > > we'd > > > > > > > > > > > > > > > > have more open file handles. > > > > > > > > > > > > > > > > I made that change and noticed some horrible performance > > side > > > > > effects. > > > > > > > > > > > > > > > > The first thing I noticed was that the CPU for our web > > > > application > > > > > > > > (ASP.NET MVC) that read from the Index never went below > > 60-70% > > > > and > > > > > was > > > > > > > > frequently pegged at 99%. > > > > > > > > > > > > > > > > In addition to the CPU spiking, the memory taken up by the > > > > > w3wp.exe > > > > > > > > process quickly grew to around 800MB, which is about 300MB > > > above > > > > > > normal. > > > > > > > > > > > > > > > > This has all the hallmarks of a memory leak somewhere. > > > > > > > > > > > > > > > > Finally, I noticed that the IndexReader was locking some of > > > the > > > > > files > > > > > > in > > > > > > > > > > > > > > > > the index folder even though the reader was set to nolock > > > mode. > > > > > This > > > > > > > > seemed to be cause of the increase in the number of files in > > > the > > > > > index > > > > > > > > folder. > > > > > > > > > > > > > > > > We have the IndexReader set to open once and then be shared > > > > among > > > > > every > > > > > > > > request to the web application. My understanding is that > > this > > > is > > > > > the > > > > > > > > correct way to do this, and this never caused and issues > > when > > > we > > > > > were > > > > > > > > optimizing the index after every update. > > > > > > > > > > > > > > > > I know this is a pretty vague problem and there could be any > > > > > number of > > > > > > > > issues involved here. However, if anyone could suggest areas > > > to > > > > > look > > > > > > > > into for possible solutions, it would be greatly > > appreciated. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Jeff > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Simone Chiaretta > > > > > > > Microsoft MVP ASP.NET - ASPInsider > > > > > > > Blog: http://codeclimber.net.nz > > > > > > > RSS: http://feeds2.feedburner.com/codeclimber > > > > > > > twitter: @simonech > > > > > > > > > > > > > > Any sufficiently advanced technology is indistinguishable from > > > > magic > > > > > > > "Life is short, play hard" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Simone Chiaretta > > > > > Microsoft MVP ASP.NET - ASPInsider > > > > > Blog: http://codeclimber.net.nz > > > > > RSS: http://feeds2.feedburner.com/codeclimber > > > > > twitter: @simonech > > > > > > > > > > Any sufficiently advanced technology is indistinguishable from > > magic > > > > > "Life is short, play hard" > > > > > > > > > > > > > > > > > > > > > > -- > > > > Simone Chiaretta > > > > Microsoft MVP ASP.NET - ASPInsider > > > > Blog: http://codeclimber.net.nz > > > > RSS: http://feeds2.feedburner.com/codeclimber > > > > twitter: @simonech > > > > > > > > Any sufficiently advanced technology is indistinguishable from magic > > > > "Life is short, play hard" > > > > > > > > > > > > > > > > > -- > > > Simone Chiaretta > > > Microsoft MVP ASP.NET - ASPInsider > > > Blog: http://codeclimber.net.nz > > > RSS: http://feeds2.feedburner.com/codeclimber > > > twitter: @simonech > > > > > > Any sufficiently advanced technology is indistinguishable from magic > > > "Life is short, play hard" > > > > > > > > > > > > -- > > Simone Chiaretta > > Microsoft MVP ASP.NET - ASPInsider > > Blog: http://codeclimber.net.nz > > RSS: http://feeds2.feedburner.com/codeclimber > > twitter: @simonech > > > > Any sufficiently advanced technology is indistinguishable from magic > > "Life is short, play hard" > > > > > > > -- > Simone Chiaretta > Microsoft MVP ASP.NET - ASPInsider > Blog: http://codeclimber.net.nz > RSS: http://feeds2.feedburner.com/codeclimber > twitter: @simonech > > Any sufficiently advanced technology is indistinguishable from magic > "Life is short, play hard" > > -- Simone Chiaretta Microsoft MVP ASP.NET - ASPInsider Blog: http://codeclimber.net.nz RSS: http://feeds2.feedburner.com/codeclimber twitter: @simonech Any sufficiently advanced technology is indistinguishable from magic "Life is short, play hard"
