Re: Indexing slower in trunk

Simon Willnauer Thu, 16 Jun 2011 06:51:02 -0700

On Tue, Jun 14, 2011 at 2:54 PM, Erick Erickson <erickerick...@gmail.com> wrote:
> Thanks, guys. Yes, I am running it all locally and disk seeks
> may well be the culprit. This thread is mainly to be sure that
> the behavior I'm seeing is expected, or at least explainable.
>
> Really, I don't need to pursue this further unless there's
> actually data I can gather to help speed things up. If this
> is just a consequence of DWPT and/or my particular
> setup then that's fine. I'm mostly trying to understand
> the characteristics of indexing/searching on the trunk.
> This started with me exploring memory
> requirements, and is really just something I noticed along
> the way and wanted to get some feedback on.
>
> So, absent the commit step, the times are reasonably
> comparable. Can I impose upon one of you to give a
> two-sentence summary of what DWPT buys us from a
> user perspective? If memory serves it should have
> background merging and other goodies.


Let me try to conclude this in a couple of sentences:

Previously IW wrote small in memory segments on a per-thread basis and
merged then together on flush. Yet, this means you can't add / update
any documents while we are flushing and flushing can take a long time.
With DWPT  we write segments single-threaded so each thread gets its
private DWPT. That allows to flush segments to disc concurrently while
carry on indexing at the same time. The performance gains are massive
here, our nightly benchmark sees 269% speedup on indexing throughput.
The downside is that you have to do more merges eventually since you
write more smallish segments (no in memory merge on flush). Plus if
you read from the same disk you are writing too you might see slower
indexing.

to read more about this I wrote a blog that explains a big portion of
it: http://blog.jteam.nl/2011/04/01/gimme-all-resources-you-have-i-can-use-them/

>
> Uwe:
> Yep, I was curious about optimize but understand that it's not required
> in recent code. That said, data is not searchable until a commit
> happens, so just for yucks I changed the optimize to a commit. Stats
> of that run below.
>
> Simon:
> OK, adjusted the ram buffer size to 512M, and it's a bit faster, but
> not all that much, see stats, and the delta could well be sampling
> errors, one run doth not a statistical certainty make. Up until the
> commit step, the admin stats page is showing no documents in
> the index so I think this setting completely avoids intermediate
> committing although that says nothing about the individual writers
> writing lots of segments to disk, that still happens.
>
> Added 1880000 docs. Took 1437 ms. cumulative interval (seconds) = 284
> Added 1890000 docs. Took 1285 ms. cumulative interval (seconds) = 285
> Added 1900000 docs. Took 1182 ms. cumulative interval (seconds) = 286
> Added 1910000 docs. Took 1675 ms. cumulative interval (seconds) = 288
> About to commit, total time so far: 290
> Total Time Taken-> 395 seconds    ***100 secs for the commit to finish.
> Total documents added-> 1917728
> Docs/sec-> 4855
>
> Thanks, all
> Erick
>
>
> On Tue, Jun 14, 2011 at 4:39 AM, Simon Willnauer
> <simon.willna...@googlemail.com> wrote:
>> Erick, it seems you need to adjust your settings for 4.0 a little.
>> When you index with DWPT it builds thread private segments which are
>> independently flushed to disk. Yet, when you set your ram buffer IW
>> will accumulate the ram used by all active DWPT and flush the largest
>> once you reach your ram buffer. with 128M you might end up wil lots of
>> small segments which need to be merged in the background. Eventually
>> what will happen here is that your disk is so busy that you are not
>> able to flush fast enough and threads might stall.
>>
>> What you can try here is adjust your RAM buffer to be a little higher,
>> lets say 350MB or change the max number of thread states in
>> DocumentsWriterPerThreadPool ie.
>> ThreadAffinityDocumentsWriterThreadPool. The latter is unfortunately
>> not exposed yet in solr so maybe for testing you just want to change
>> the default value in DocumentsWriterPerThreadPool to 4. That will also
>> cause segments to be bigger eventually.
>>
>> simon
>>
>> On Tue, Jun 14, 2011 at 10:28 AM, Uwe Schindler <u...@thetaphi.de> wrote:
>>> Hi Erick,
>>>
>>> Do you use harddisks or SSDs? I assume harddisks, which may explain what you
>>> see:
>>>
>>> - DWPT writes lots of segments in parallel, which also explains why you are
>>> seeing more files. Writing in parallel to several files, needs more head
>>> movements of your harddisk and this slows down. In the past, only one
>>> segment was written at the same time (sequential), so the harddisk is not so
>>> stressed.
>>> - Optimizing may be slower for the same reason: there are many more files to
>>> merge (but optimize cost should not be counted as a problem here as normally
>>> you won't need to optimize after initial indexing and optimizing was only a
>>> good idea pre Lucene-2.9, now it's mostly obsolete)
>>>
>>> Uwe
>>>
>>> -----
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> http://www.thetaphi.de
>>> eMail: u...@thetaphi.de
>>>
>>>
>>>> -----Original Message-----
>>>> From: Erick Erickson [mailto:erickerick...@gmail.com]
>>>> Sent: Tuesday, June 14, 2011 2:46 AM
>>>> To: dev@lucene.apache.org; simon.willna...@gmail.com
>>>> Subject: Re: Indexing slower in trunk
>>>>
>>>> Simon:
>>>>
>>>> Yep, I was asking to see if it was weird. Pursuant to our
>>>> chat I tried some things, results below:
>>>>
>>>> All these are running on my local machine, same disk, same
>>>> JVM settings re: memory. The SolrJ indexing is happening
>>>> from IntelliJ (OK, I'm lazy).
>>>>
>>>> Rambuffer is set at 128M in all cases. Merge factor is 10.
>>>>
>>>> I'm allocation 2G to the server and 2G to the indexer
>>>>
>>>> Servers get started like this:
>>>> _server = new StreamingUpdateSolrServer(url, 10, 4);
>>>>
>>>> I choose the threads and queue length semi-arbitrarily.
>>>>
>>>> Autocommit originally was 10,000 docs, 60,000 maxTime for
>>>> all tests, but I removed that in the case of trunk , substituting
>>>> a "commitWithin" in the SolrJ code of 10 minutes. That flattened
>>>> out the run up until after the last doc was added, when I tried
>>>> to do an optimize and it took 270 seconds on trunk and 19
>>>> seconds on 3x. I do see a commit happening in the log on trunk
>>>> when I issue this, which isn't surprising.
>>>>
>>>> I'm a little surprised that I also see a bunch of  segments forming
>>>> during indexing on trunk, even though I don't issue any commits and
>>>> I doubt that I was running into ram buffer problems. By the time I'd
>>> indexed
>>>> 400K docs, there were 206 files in the index, by 1000K docs
>>>> there were 416. But I vaguely remember that there's some
>>>> background merging going on with DWPT, which would
>>>> explain the fact that the number of fies in the directory
>>>> keeps bouncing up and down (although generally
>>>> trends up till flattening out in the low-mid 400s). Don't know
>>>> whether it means anything or not, but the number of files
>>>> goes down when there are long pauses in the "interval"
>>>> reported to the console.
>>>>
>>>> Note below that the trunk seems slower even without
>>>> committing. But the optimize shoots through the roof.
>>>>
>>>> I've attached a file containing the full console output from
>>>> both runs, and the indexing code for your perusal. I can
>>>> easily try different parameters if you have some you
>>>> want me to run. I can instrument the code, compile and run
>>>> it if that'd help, just let me know where.
>>>>
>>>> I suspect that I've done something silly, but I have no clue
>>>> what as yet.
>>>>
>>>>
>>>> Erick
>>>>
>>>> Here's the end of the console output for the 3x case:
>>>> End of the run from my Idea console window (Took is in ms):
>>>>
>>>> Added 1890000 docs. Took 1361 ms. cumulative interval (seconds) = 248
>>>> Added 1900000 docs. Took 719 ms. cumulative interval (seconds) = 248
>>>> Added 1910000 docs. Took 1305 ms. cumulative interval (seconds) = 250
>>>> Jun 13, 2011 8:01:08 PM
>>>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner run
>>>> INFO: finished:
>>>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner@60487
>>>> c5f
>>>> Jun 13, 2011 8:01:08 PM
>>>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner run
>>>> INFO: finished:
>>>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner@132f4
>>>> 538
>>>> Jun 13, 2011 8:01:09 PM
>>>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner run
>>>> INFO: finished:
>>>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner@46969
>>>> 5f
>>>> Jun 13, 2011 8:01:09 PM
>>>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner run
>>>> INFO: finished:
>>>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner@58f41
>>>> 393
>>>> Total Time Taken-> 252 seconds
>>>> Total documents added-> 1917728
>>>> Docs/sec-> 7610
>>>> starting optimize
>>>> optimizing took 19 seconds
>>>>
>>>> Here's the end from trunk:
>>>> Added 1880000 docs. Took 1513 ms. cumulative interval (seconds) = 304
>>>> Added 1890000 docs. Took 1518 ms. cumulative interval (seconds) = 306
>>>> Added 1900000 docs. Took 1703 ms. cumulative interval (seconds) = 307
>>>> Added 1910000 docs. Took 1161 ms. cumulative interval (seconds) = 309
>>>> About to commit, total time so far: 309
>>>> Total Time Taken-> 309 seconds
>>>> Total documents added-> 1917728
>>>> Docs/sec-> 6206
>>>> starting optimize
>>>> optimizing took 270 seconds
>>>>
>>>> On Mon, Jun 13, 2011 at 4:50 PM, Simon Willnauer
>>>> <simon.willna...@googlemail.com> wrote:
>>>> > On Mon, Jun 13, 2011 at 8:13 PM, Erick Erickson
>>> <erickerick...@gmail.com>
>>>> wrote:
>>>> >> I half remember that this has come up before, but I couldn't find the
>>>> >> thread. I was running some tests over the weekend that involved
>>>> >> indexing 1.9M documents from the English Wiki dump.
>>>> >>
>>>> >> I'm consistently seeing that trunk takes about twice as long to index
>>>> >> the docs as 1.4, 3.2 and 3x. Optimize is also taking quite a bit
>>>> >> longer I admit that these aren't very sophisticated tests, and I only
>>>> >> ran the trunk process twice (although both those were consistent).
>>>> >>
>>>> >> I'm pretty sure my rambuffersize and autocommit settings are
>>>> >> identical. I remove the data/index directory before each run. These
>>>> >> results are running the indexing program in IntelliJ, on my Mac, both
>>>> >> the server and the indexing programs were running locally.
>>>> >>
>>>> >> No, trunk isn't compiling before running <G>.
>>>> >>
>>>> >> Here's the server definition:
>>>> >> new StreamingUpdateSolrServer(url, 10, 4);
>>>> >>
>>>> >> and I'm batching up the documents and sending them to Solr in batches
>>> of
>>>> 1,000.
>>>> >>
>>>> >> So, my question is whether this should be pursued. Note that I'm still
>>>> >> getting around 3K docs/second, which I can't complain about. Not that
>>>> >> that stops me, you understand. And in return for a memory footprint
>>>> >> reduction from 389M to 90M after some off-the-wall sorting and
>>>> >> faceting I'll take it!
>>>> >>
>>>> >> Hmmmm, speaking of which, the memory usage changes seem like a
>>>> good
>>>> >> candidate for a page on the Wiki, anyone want to suggest a home?
>>>> >>
>>>> >>
>>>> >> Solr 1.4.1
>>>> >> Total Time Taken-> 257 seconds
>>>> >> Total documents added-> 1917728
>>>> >> Docs/sec-> 7461
>>>> >> starting optimize
>>>> >> optimizing took 26 seconds
>>>> >>
>>>> >> Solr 3.2
>>>> >> Total Time Taken-> 243 seconds
>>>> >> Total documents added-> 1917728
>>>> >> Docs/sec-> 7891
>>>> >> starting optimize
>>>> >> optimizing took 21 seconds
>>>> >>
>>>> >> Solr 3x
>>>> >> Total Time Taken-> 269 seconds
>>>> >> Total documents added-> 1917728
>>>> >> Docs/sec-> 7129
>>>> >> starting optimize
>>>> >> optimizing took 21 seconds
>>>> >>
>>>> >> Solr trunk. 2011-6-11: 17:24 EST
>>>> >> Total Time Taken-> 592 seconds
>>>> >> Total documents added-> 1917728
>>>> >> Docs/sec-> 3239
>>>> >> starting optimize
>>>> >> optimizing took 159 seconds
>>>> >>
>>>> >> What do folks think? Is there anything I can/should do to narrow this
>>>> down?
>>>> >
>>>> > Hi Eric,
>>>> >
>>>> > this looks weird, I have some questions:
>>>> >
>>>> > - you are indexing into the same disk as you read the data from?
>>>> > - what are you rambuffer settings?
>>>> > - how many threads are you using to send data to solr?
>>>> > - what is your autocommit setting?
>>>> >
>>>> > simon
>>>> >
>>>> >
>>>> >>
>>>> >> Erick
>>>> >>
>>>> >> ---------------------------------------------------------------------
>>>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>> >>
>>>> >>
>>>> >
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>>>> >
>>>> >
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Indexing slower in trunk

Reply via email to