Re: Flushing Thread

Simon Willnauer Sat, 21 Jul 2012 00:33:37 -0700

On Fri, Jul 20, 2012 at 2:43 PM, Simon McDuff <smcd...@hotmail.com> wrote:
>
> Hi Simon W.,
> See comments below.
>> Date: Fri, 20 Jul 2012 11:49:03 +0200> Subject: Re: Flushing Thread
>> From: simon.willna...@gmail.com
>> To: java-user@lucene.apache.org
>>
>> hey simon ;)
>>
>>
>> On Fri, Jul 20, 2012 at 2:29 AM, Simon McDuff <smcd...@hotmail.com> wrote:
>> >
>> > Thank you Simon Willnauer!
>> >
>> > With your explanation, we`ve decided to control the flushing by spawning 
>> > another thread. So the thread is available to still ingest ! :-) (correct 
>> > me if I'm wrong)We do so by checking the RAM size provided by Lucene! 
>> > (Thank you!)By putting the automatic flushing at 1000 megs and our 
>> > controlling at 900 megs, we know that the automatic flushing "should" not 
>> > happen.
>>
>> it should not. Yet, 1G is a large ram buffer. In my tests I got much
>> better results with lowish ram buffers like 256MB since that causes
>> flush to happen more often and it saturates your IO on the machine.
>> The general goal is to keep the RAM buffer at a level where you almost
>> constantly flush ie. you maximise the the RAM buffer so that a flush
>> should happen once you are done with the previous flush. Does that
>> make sense?
> [SIMON M.] It make sense for some use cases.In our case we have FUSION IO 
> Cards that write at 6 Gb/S. We do not have contention for IO.Also, we use 
> larger RAM to compress as much as possible (we have a lot of compression). 
> (in fact we found that 500 megs was enough)


cool man I am jealous! :) I have to admit that the average usecase is
not on Fusion IO :) I think for commodity this makes lots of sense
though.
>
>>
>> > I know you contribute a lot to the concurrency feature! This is great! I 
>> > was very excited to try it!
>> > We tried the following approaches:Option 1- 6 threads referring to the 
>> > same IndexWriterOption 2- 6 threads having their own IndexWriter, merge it 
>> > at the end
>> > Unfortunately, we found that option 2 scale better. I'm not sure why 
>> > option 1 didn`t scale. Is it possible that synchronization between threads 
>> > is too costly ? ... I don`t have an answered but it was definitely slower.
>>
>> can you provide the numbers and what you actually did in your experiment.
> [SIMON M.] I'm not at work today, I can provide these numbers monday if you 
> are still interested.

cool I am totally interested. I will be on vacation until early august
so I might reply late.

>>
>> > With option 2, we are able to insert between 800 000 - 900 000 documents / 
>> > sec. (we've modified lucene to remove some bottleneck)Threads DO NOT ONLY 
>> > index, it does other stuff before adding documents.
>>
>> what are your modifications? 800k documents are a lot! I wonder what
>> you are indexing, do you have any text you are inverting. I have run
>> tests on a very strong machine on 4k /doc average doc size and I
>> couldn't even get 10% of this. So in your case lock contention in the
>> indexwriter (there are still blocking parts) could be dominating. This
>> is certainly not what we optimize for. I'd say 99% of the cases the
>> most of the time is spend in DocumentsWriterPerThread inverting the
>> document. If that is not the case in your experiment and you are only
>> measuring thread overhead then I can totally buy your numbers.
>> [SIMON M.] We have 3 fields, (2 Fixed ByteRef and one bigger 
>> (textField))800k  is for the 6 threads all together, so one thread is about 
>> 133 333 doc / secs.To achieve that performance we :- Removed notifications 
>> process in lucene that does check for stalled flushing... it was really 
>> slow.- We spot some places were memory wasn`t recycle properly.- Removed 
>> Stored writer ... we do not use store field.- One IndexWriter per Thread.

that is very interesting - I changed the stalling stuff just before
4.0 alpha from non-blocking to blocking. I might need to thing about a
different way of doing that. Can you please report the places where
you see memory problems?

I have some ideas about refactoring the IW to make usecases like yours
easier here is my brain dump on IRC just for the record:

[6:46pm] s1monw: 1. I wanna logically divide Writeing and merging
[6:46pm] s1monw: so all merge code should go in a dedicated API
[6:46pm] s1monw: 2. IndexWriter should be a Composite out of IndexWriters
[6:47pm] s1monw: 2a. the composite would take care of notifying the
merger and handle deletes
[6:47pm] s1monw: and do the commit
[6:47pm] s1monw: the other IW are single threaded
[6:48pm] s1monw: no synchonization
[6:48pm] mikemccand: sounds awesome!
[6:48pm] s1monw: ie like aDocumentsWriterPerThread just public
[6:48pm] s1monw: that way you can build something like simonm wants easily
[6:48pm] mikemccand: so each app thread opens a dedicated writer?
[6:48pm] s1monw: well you can have multiple models
[6:49pm] s1monw: you can have a non-blocking IW where you just hand off docs
[6:49pm] s1monw: and get a callback once done
[6:49pm] s1monw: or you have what we have today
[6:49pm] s1monw: like blocking
[6:49pm] s1monw: but you can also if you don't use updateDocument have
a IW per thread
[6:49pm] s1monw: like you just said
[6:49pm] s1monw: its all up to you
[6:49pm] mikemccand: ok
[6:49pm] s1monw: makes sense
[6:49pm] s1monw: ?
[6:49pm] mikemccand: yes!
[6:50pm] s1monw: ok cool
[6:50pm] mikemccand: handling pending deletes seems tricky...
[6:50pm] s1monw: did I say its easy
[6:50pm] s1monw: :D
[6:50pm] mikemccand: LOL
[6:50pm] s1monw: its basically all up to the wrapper
[6:50pm] s1monw: so lets say we have a ClassicalIW
[6:51pm] s1monw: that has N IndexWriterPerThread
[6:51pm] s1monw: IndexWriterPerThread allows to flush in a single
thread and hands back the deletes it has
[6:52pm] mikemccand: good
[6:52pm] s1monw: the ClassicIW handles deletes on a global level
[6:52pm] s1monw: and it handles the deletes per IWPerThread
[6:53pm] s1monw: so IWPerThread doesn't know about deletes until flush
[6:53pm] s1monw: on flush we pass in a BitSet that marks deleted docs
[6:53pm] s1monw: ie updates on another thread
[6:53pm] s1monw: ie docId + seq id
[6:53pm] s1monw: or something like that
[6:53pm] s1monw: err. Term + seq id
[6:53pm] s1monw: and apply everything on flush


simon

>> > Did you look at the disruptor pattern (by LMAX) ? It helped us a lot to 
>> > achieve great performance in multithreaded environment!>
>> I know of the pattern though their usecase is totally different to
>> ours. The time spend per transaction is super low compared to the
>> thread overhead so they try to optimize this for high performance
>> computing. ie. for like 5M transactions per second you enter / leave
>> locks literally all the freaking time. With IndexWriter you don't have
>> such a pattern. Large numbers would be like 50k / sec that it 2 orders
>> of a magnitude less so lock overhead becomes minor since contention is
>> much lower. If you go and make your documents super super small like
>> not invert anything or just store you might see an overhead in the
>> threading model I agree. Our bottleneck is not lock contention here
>> but IO and that is what we optimized this for. Makes sense?[SIMON M.] Not 
>> really, By adding document without any store everything is in memory until 
>> we flush. So the bottleneck wasn`t IO.
> When adding document, this is where we optimized.
> I just mentionned disruptor because I thought a design having an IndexWriter 
> having a ringbuffer inside and many threads that write or flush would be 
> faster.In fact this is what we did but externally and by having one 
> indexWriter per thread (6 indexWriters). By doing it internally I think we 
> could remove a lot of overhead.The advantage is your producer should never 
> block. :-)But the draw back is you need to do copy these fields to the ring 
> buffer. I do understand it is not suitable for everybody.
>>
>> That said, if you really wanna optimize this you could write your own
>> DocumentsWriterPerThreadPool and a custom FlushPolicy (both package
>> private in org.apache.lucene.index) in DWPThreadPool you only maintain
>> one DWPT and in the FlushPolicy you only track ram consumption of that
>> DWPT. Once you see that it has filled up you notify another thread
>> that its time for flush and go out and call commit. You can then over
>> time find out what is the right RAM buffer to saturate IO, don't
>> create too many segments to kill performance due to too many
>> background merges and maximise in memory throughput.
> [SIMON M.] Thank you for the tips. I will continue to find bottlenecks we 
> have!
>>
>> simonw :)
>>
>>
>> > Thank you
>> > Simon M.
>> >
>> >
>> >
>> >
>> >> Date: Thu, 19 Jul 2012 21:52:19 +0200
>> >> Subject: Re: Flushing Thread
>> >> From: simon.willna...@gmail.com
>> >> To: java-user@lucene.apache.org
>> >>
>> >> hey,
>> >>
>> >> On Thu, Jul 19, 2012 at 7:41 PM, Simon McDuff <smcd...@hotmail.com> wrote:
>> >> >
>> >> > Thank you for your answer!
>> >> >
>> >> > I read all your blogs! It is always interesting!
>> >>
>> >> for details see:
>> >>
>> >> http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/
>> >>
>> >> and
>> >>
>> >> http://www.searchworkings.org/blog/-/blogs/lucene-indexing-gains-concurrency/
>> >> >
>> >> > My understanding is probably incorrect ...
>> >> > I observed that if you have only one thread that addDocument, it will 
>> >> > not spawn another thread for flushing, it uses the main thread.
>> >>
>> >> every indexing thread can hit a flush. if you only have one thread you
>> >> will not make progress adding docs while flushing.
>> >> IW will not create new threads for flushing.
>> >> > In this case, my main thread is locked. Correct ?
>> >> >
>> >> > The concurrent flushing will ONLY work when I have many threads adding 
>> >> > documents ? (In that case I will need to put a ringbuffer in front)
>> >>
>> >> that is basically correct. You can frequently call commit / or pull a
>> >> reader from the IW in a different thread before you ram buffer fills
>> >> up so that flushing happens in a different thread. That could work
>> >> pretty well if you don't have many deletes to be applied. (if you have
>> >> many deletes then pull a reader without applying deletes.
>> >>
>> >> simon
>> >> >
>> >> > Do I understand correctly ? Did I miss something ?
>> >> >
>> >> > Simon
>> >> >
>> >> >> From: luc...@mikemccandless.com
>> >> >> Date: Thu, 19 Jul 2012 13:02:42 -0400
>> >> >> Subject: Re: Flushing Thread
>> >> >> To: java-user@lucene.apache.org
>> >> >>
>> >> >> This has already been fixed on Lucene 4.0 (we now have fully
>> >> >> concurrent flushing), eg see:
>> >> >>
>> >> >>   
>> >> >> http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html
>> >> >>
>> >> >> Mike McCandless
>> >> >>
>> >> >> http://blog.mikemccandless.com
>> >> >>
>> >> >> On Thu, Jul 19, 2012 at 12:54 PM, Simon McDuff <smcd...@hotmail.com> 
>> >> >> wrote:
>> >> >> >
>> >> >> > I see some behavior at the moment when I'm flushing and would like 
>> >> >> > to know if I can change that.
>> >> >> >
>> >> >> >  One main thread is inserting, when it flushes, it blocks.
>> >> >> >  During that time my main thread is blocking. Instead of blocking, 
>> >> >> > Could it spawn another thread to do that ?
>> >> >> >
>> >> >> > Basically,  would like to have one main thread adding document to my 
>> >> >> > index, if a flushing needs to occur, spawn another threads but it 
>> >> >> > should never lock the main  threads. Is it possible ?
>> >> >> >
>> >> >> > Is the only solution is to have many threads indexing the data ?
>> >> >> > In that case Is it true to say ONLY one of them will be busy while 
>> >> >> > the other is flushing ? (I do understand that if my flushing is 
>> >> >> > taking two much time, they will both flush... :-))
>> >> >> >
>> >> >> > Thank you!
>> >> >> >
>> >> >> > Simon
>> >> >> >
>> >> >> >
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >> >>
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Flushing Thread

Reply via email to