On Fri, Jul 20, 2012 at 2:43 PM, Simon McDuff <smcd...@hotmail.com> wrote: > > Hi Simon W., > See comments below. >> Date: Fri, 20 Jul 2012 11:49:03 +0200> Subject: Re: Flushing Thread >> From: simon.willna...@gmail.com >> To: java-user@lucene.apache.org >> >> hey simon ;) >> >> >> On Fri, Jul 20, 2012 at 2:29 AM, Simon McDuff <smcd...@hotmail.com> wrote: >> > >> > Thank you Simon Willnauer! >> > >> > With your explanation, we`ve decided to control the flushing by spawning >> > another thread. So the thread is available to still ingest ! :-) (correct >> > me if I'm wrong)We do so by checking the RAM size provided by Lucene! >> > (Thank you!)By putting the automatic flushing at 1000 megs and our >> > controlling at 900 megs, we know that the automatic flushing "should" not >> > happen. >> >> it should not. Yet, 1G is a large ram buffer. In my tests I got much >> better results with lowish ram buffers like 256MB since that causes >> flush to happen more often and it saturates your IO on the machine. >> The general goal is to keep the RAM buffer at a level where you almost >> constantly flush ie. you maximise the the RAM buffer so that a flush >> should happen once you are done with the previous flush. Does that >> make sense? > [SIMON M.] It make sense for some use cases.In our case we have FUSION IO > Cards that write at 6 Gb/S. We do not have contention for IO.Also, we use > larger RAM to compress as much as possible (we have a lot of compression). > (in fact we found that 500 megs was enough)
cool man I am jealous! :) I have to admit that the average usecase is not on Fusion IO :) I think for commodity this makes lots of sense though. > >> >> > I know you contribute a lot to the concurrency feature! This is great! I >> > was very excited to try it! >> > We tried the following approaches:Option 1- 6 threads referring to the >> > same IndexWriterOption 2- 6 threads having their own IndexWriter, merge it >> > at the end >> > Unfortunately, we found that option 2 scale better. I'm not sure why >> > option 1 didn`t scale. Is it possible that synchronization between threads >> > is too costly ? ... I don`t have an answered but it was definitely slower. >> >> can you provide the numbers and what you actually did in your experiment. > [SIMON M.] I'm not at work today, I can provide these numbers monday if you > are still interested. cool I am totally interested. I will be on vacation until early august so I might reply late. >> >> > With option 2, we are able to insert between 800 000 - 900 000 documents / >> > sec. (we've modified lucene to remove some bottleneck)Threads DO NOT ONLY >> > index, it does other stuff before adding documents. >> >> what are your modifications? 800k documents are a lot! I wonder what >> you are indexing, do you have any text you are inverting. I have run >> tests on a very strong machine on 4k /doc average doc size and I >> couldn't even get 10% of this. So in your case lock contention in the >> indexwriter (there are still blocking parts) could be dominating. This >> is certainly not what we optimize for. I'd say 99% of the cases the >> most of the time is spend in DocumentsWriterPerThread inverting the >> document. If that is not the case in your experiment and you are only >> measuring thread overhead then I can totally buy your numbers. >> [SIMON M.] We have 3 fields, (2 Fixed ByteRef and one bigger >> (textField))800k is for the 6 threads all together, so one thread is about >> 133 333 doc / secs.To achieve that performance we :- Removed notifications >> process in lucene that does check for stalled flushing... it was really >> slow.- We spot some places were memory wasn`t recycle properly.- Removed >> Stored writer ... we do not use store field.- One IndexWriter per Thread. that is very interesting - I changed the stalling stuff just before 4.0 alpha from non-blocking to blocking. I might need to thing about a different way of doing that. Can you please report the places where you see memory problems? I have some ideas about refactoring the IW to make usecases like yours easier here is my brain dump on IRC just for the record: [6:46pm] s1monw: 1. I wanna logically divide Writeing and merging [6:46pm] s1monw: so all merge code should go in a dedicated API [6:46pm] s1monw: 2. IndexWriter should be a Composite out of IndexWriters [6:47pm] s1monw: 2a. the composite would take care of notifying the merger and handle deletes [6:47pm] s1monw: and do the commit [6:47pm] s1monw: the other IW are single threaded [6:48pm] s1monw: no synchonization [6:48pm] mikemccand: sounds awesome! [6:48pm] s1monw: ie like aDocumentsWriterPerThread just public [6:48pm] s1monw: that way you can build something like simonm wants easily [6:48pm] mikemccand: so each app thread opens a dedicated writer? [6:48pm] s1monw: well you can have multiple models [6:49pm] s1monw: you can have a non-blocking IW where you just hand off docs [6:49pm] s1monw: and get a callback once done [6:49pm] s1monw: or you have what we have today [6:49pm] s1monw: like blocking [6:49pm] s1monw: but you can also if you don't use updateDocument have a IW per thread [6:49pm] s1monw: like you just said [6:49pm] s1monw: its all up to you [6:49pm] mikemccand: ok [6:49pm] s1monw: makes sense [6:49pm] s1monw: ? [6:49pm] mikemccand: yes! [6:50pm] s1monw: ok cool [6:50pm] mikemccand: handling pending deletes seems tricky... [6:50pm] s1monw: did I say its easy [6:50pm] s1monw: :D [6:50pm] mikemccand: LOL [6:50pm] s1monw: its basically all up to the wrapper [6:50pm] s1monw: so lets say we have a ClassicalIW [6:51pm] s1monw: that has N IndexWriterPerThread [6:51pm] s1monw: IndexWriterPerThread allows to flush in a single thread and hands back the deletes it has [6:52pm] mikemccand: good [6:52pm] s1monw: the ClassicIW handles deletes on a global level [6:52pm] s1monw: and it handles the deletes per IWPerThread [6:53pm] s1monw: so IWPerThread doesn't know about deletes until flush [6:53pm] s1monw: on flush we pass in a BitSet that marks deleted docs [6:53pm] s1monw: ie updates on another thread [6:53pm] s1monw: ie docId + seq id [6:53pm] s1monw: or something like that [6:53pm] s1monw: err. Term + seq id [6:53pm] s1monw: and apply everything on flush simon >> > Did you look at the disruptor pattern (by LMAX) ? It helped us a lot to >> > achieve great performance in multithreaded environment!> >> I know of the pattern though their usecase is totally different to >> ours. The time spend per transaction is super low compared to the >> thread overhead so they try to optimize this for high performance >> computing. ie. for like 5M transactions per second you enter / leave >> locks literally all the freaking time. With IndexWriter you don't have >> such a pattern. Large numbers would be like 50k / sec that it 2 orders >> of a magnitude less so lock overhead becomes minor since contention is >> much lower. If you go and make your documents super super small like >> not invert anything or just store you might see an overhead in the >> threading model I agree. Our bottleneck is not lock contention here >> but IO and that is what we optimized this for. Makes sense?[SIMON M.] Not >> really, By adding document without any store everything is in memory until >> we flush. So the bottleneck wasn`t IO. > When adding document, this is where we optimized. > I just mentionned disruptor because I thought a design having an IndexWriter > having a ringbuffer inside and many threads that write or flush would be > faster.In fact this is what we did but externally and by having one > indexWriter per thread (6 indexWriters). By doing it internally I think we > could remove a lot of overhead.The advantage is your producer should never > block. :-)But the draw back is you need to do copy these fields to the ring > buffer. I do understand it is not suitable for everybody. >> >> That said, if you really wanna optimize this you could write your own >> DocumentsWriterPerThreadPool and a custom FlushPolicy (both package >> private in org.apache.lucene.index) in DWPThreadPool you only maintain >> one DWPT and in the FlushPolicy you only track ram consumption of that >> DWPT. Once you see that it has filled up you notify another thread >> that its time for flush and go out and call commit. You can then over >> time find out what is the right RAM buffer to saturate IO, don't >> create too many segments to kill performance due to too many >> background merges and maximise in memory throughput. > [SIMON M.] Thank you for the tips. I will continue to find bottlenecks we > have! >> >> simonw :) >> >> >> > Thank you >> > Simon M. >> > >> > >> > >> > >> >> Date: Thu, 19 Jul 2012 21:52:19 +0200 >> >> Subject: Re: Flushing Thread >> >> From: simon.willna...@gmail.com >> >> To: java-user@lucene.apache.org >> >> >> >> hey, >> >> >> >> On Thu, Jul 19, 2012 at 7:41 PM, Simon McDuff <smcd...@hotmail.com> wrote: >> >> > >> >> > Thank you for your answer! >> >> > >> >> > I read all your blogs! It is always interesting! >> >> >> >> for details see: >> >> >> >> http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/ >> >> >> >> and >> >> >> >> http://www.searchworkings.org/blog/-/blogs/lucene-indexing-gains-concurrency/ >> >> > >> >> > My understanding is probably incorrect ... >> >> > I observed that if you have only one thread that addDocument, it will >> >> > not spawn another thread for flushing, it uses the main thread. >> >> >> >> every indexing thread can hit a flush. if you only have one thread you >> >> will not make progress adding docs while flushing. >> >> IW will not create new threads for flushing. >> >> > In this case, my main thread is locked. Correct ? >> >> > >> >> > The concurrent flushing will ONLY work when I have many threads adding >> >> > documents ? (In that case I will need to put a ringbuffer in front) >> >> >> >> that is basically correct. You can frequently call commit / or pull a >> >> reader from the IW in a different thread before you ram buffer fills >> >> up so that flushing happens in a different thread. That could work >> >> pretty well if you don't have many deletes to be applied. (if you have >> >> many deletes then pull a reader without applying deletes. >> >> >> >> simon >> >> > >> >> > Do I understand correctly ? Did I miss something ? >> >> > >> >> > Simon >> >> > >> >> >> From: luc...@mikemccandless.com >> >> >> Date: Thu, 19 Jul 2012 13:02:42 -0400 >> >> >> Subject: Re: Flushing Thread >> >> >> To: java-user@lucene.apache.org >> >> >> >> >> >> This has already been fixed on Lucene 4.0 (we now have fully >> >> >> concurrent flushing), eg see: >> >> >> >> >> >> >> >> >> http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html >> >> >> >> >> >> Mike McCandless >> >> >> >> >> >> http://blog.mikemccandless.com >> >> >> >> >> >> On Thu, Jul 19, 2012 at 12:54 PM, Simon McDuff <smcd...@hotmail.com> >> >> >> wrote: >> >> >> > >> >> >> > I see some behavior at the moment when I'm flushing and would like >> >> >> > to know if I can change that. >> >> >> > >> >> >> > One main thread is inserting, when it flushes, it blocks. >> >> >> > During that time my main thread is blocking. Instead of blocking, >> >> >> > Could it spawn another thread to do that ? >> >> >> > >> >> >> > Basically, would like to have one main thread adding document to my >> >> >> > index, if a flushing needs to occur, spawn another threads but it >> >> >> > should never lock the main threads. Is it possible ? >> >> >> > >> >> >> > Is the only solution is to have many threads indexing the data ? >> >> >> > In that case Is it true to say ONLY one of them will be busy while >> >> >> > the other is flushing ? (I do understand that if my flushing is >> >> >> > taking two much time, they will both flush... :-)) >> >> >> > >> >> >> > Thank you! >> >> >> > >> >> >> > Simon >> >> >> > >> >> >> > >> >> >> >> >> >> --------------------------------------------------------------------- >> >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> >> > >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org