Hi Erick,

Do you use harddisks or SSDs? I assume harddisks, which may explain what you
see:

- DWPT writes lots of segments in parallel, which also explains why you are
seeing more files. Writing in parallel to several files, needs more head
movements of your harddisk and this slows down. In the past, only one
segment was written at the same time (sequential), so the harddisk is not so
stressed.
- Optimizing may be slower for the same reason: there are many more files to
merge (but optimize cost should not be counted as a problem here as normally
you won't need to optimize after initial indexing and optimizing was only a
good idea pre Lucene-2.9, now it's mostly obsolete)

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -----Original Message-----
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Tuesday, June 14, 2011 2:46 AM
> To: dev@lucene.apache.org; simon.willna...@gmail.com
> Subject: Re: Indexing slower in trunk
> 
> Simon:
> 
> Yep, I was asking to see if it was weird. Pursuant to our
> chat I tried some things, results below:
> 
> All these are running on my local machine, same disk, same
> JVM settings re: memory. The SolrJ indexing is happening
> from IntelliJ (OK, I'm lazy).
> 
> Rambuffer is set at 128M in all cases. Merge factor is 10.
> 
> I'm allocation 2G to the server and 2G to the indexer
> 
> Servers get started like this:
> _server = new StreamingUpdateSolrServer(url, 10, 4);
> 
> I choose the threads and queue length semi-arbitrarily.
> 
> Autocommit originally was 10,000 docs, 60,000 maxTime for
> all tests, but I removed that in the case of trunk , substituting
> a "commitWithin" in the SolrJ code of 10 minutes. That flattened
> out the run up until after the last doc was added, when I tried
> to do an optimize and it took 270 seconds on trunk and 19
> seconds on 3x. I do see a commit happening in the log on trunk
> when I issue this, which isn't surprising.
> 
> I'm a little surprised that I also see a bunch of  segments forming
> during indexing on trunk, even though I don't issue any commits and
> I doubt that I was running into ram buffer problems. By the time I'd
indexed
> 400K docs, there were 206 files in the index, by 1000K docs
> there were 416. But I vaguely remember that there's some
> background merging going on with DWPT, which would
> explain the fact that the number of fies in the directory
> keeps bouncing up and down (although generally
> trends up till flattening out in the low-mid 400s). Don't know
> whether it means anything or not, but the number of files
> goes down when there are long pauses in the "interval"
> reported to the console.
> 
> Note below that the trunk seems slower even without
> committing. But the optimize shoots through the roof.
> 
> I've attached a file containing the full console output from
> both runs, and the indexing code for your perusal. I can
> easily try different parameters if you have some you
> want me to run. I can instrument the code, compile and run
> it if that'd help, just let me know where.
> 
> I suspect that I've done something silly, but I have no clue
> what as yet.
> 
> 
> Erick
> 
> Here's the end of the console output for the 3x case:
> End of the run from my Idea console window (Took is in ms):
> 
> Added 1890000 docs. Took 1361 ms. cumulative interval (seconds) = 248
> Added 1900000 docs. Took 719 ms. cumulative interval (seconds) = 248
> Added 1910000 docs. Took 1305 ms. cumulative interval (seconds) = 250
> Jun 13, 2011 8:01:08 PM
> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner run
> INFO: finished:
> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner@60487
> c5f
> Jun 13, 2011 8:01:08 PM
> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner run
> INFO: finished:
> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner@132f4
> 538
> Jun 13, 2011 8:01:09 PM
> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner run
> INFO: finished:
> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner@46969
> 5f
> Jun 13, 2011 8:01:09 PM
> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner run
> INFO: finished:
> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner@58f41
> 393
> Total Time Taken-> 252 seconds
> Total documents added-> 1917728
> Docs/sec-> 7610
> starting optimize
> optimizing took 19 seconds
> 
> Here's the end from trunk:
> Added 1880000 docs. Took 1513 ms. cumulative interval (seconds) = 304
> Added 1890000 docs. Took 1518 ms. cumulative interval (seconds) = 306
> Added 1900000 docs. Took 1703 ms. cumulative interval (seconds) = 307
> Added 1910000 docs. Took 1161 ms. cumulative interval (seconds) = 309
> About to commit, total time so far: 309
> Total Time Taken-> 309 seconds
> Total documents added-> 1917728
> Docs/sec-> 6206
> starting optimize
> optimizing took 270 seconds
> 
> On Mon, Jun 13, 2011 at 4:50 PM, Simon Willnauer
> <simon.willna...@googlemail.com> wrote:
> > On Mon, Jun 13, 2011 at 8:13 PM, Erick Erickson
<erickerick...@gmail.com>
> wrote:
> >> I half remember that this has come up before, but I couldn't find the
> >> thread. I was running some tests over the weekend that involved
> >> indexing 1.9M documents from the English Wiki dump.
> >>
> >> I'm consistently seeing that trunk takes about twice as long to index
> >> the docs as 1.4, 3.2 and 3x. Optimize is also taking quite a bit
> >> longer I admit that these aren't very sophisticated tests, and I only
> >> ran the trunk process twice (although both those were consistent).
> >>
> >> I'm pretty sure my rambuffersize and autocommit settings are
> >> identical. I remove the data/index directory before each run. These
> >> results are running the indexing program in IntelliJ, on my Mac, both
> >> the server and the indexing programs were running locally.
> >>
> >> No, trunk isn't compiling before running <G>.
> >>
> >> Here's the server definition:
> >> new StreamingUpdateSolrServer(url, 10, 4);
> >>
> >> and I'm batching up the documents and sending them to Solr in batches
of
> 1,000.
> >>
> >> So, my question is whether this should be pursued. Note that I'm still
> >> getting around 3K docs/second, which I can't complain about. Not that
> >> that stops me, you understand. And in return for a memory footprint
> >> reduction from 389M to 90M after some off-the-wall sorting and
> >> faceting I'll take it!
> >>
> >> Hmmmm, speaking of which, the memory usage changes seem like a
> good
> >> candidate for a page on the Wiki, anyone want to suggest a home?
> >>
> >>
> >> Solr 1.4.1
> >> Total Time Taken-> 257 seconds
> >> Total documents added-> 1917728
> >> Docs/sec-> 7461
> >> starting optimize
> >> optimizing took 26 seconds
> >>
> >> Solr 3.2
> >> Total Time Taken-> 243 seconds
> >> Total documents added-> 1917728
> >> Docs/sec-> 7891
> >> starting optimize
> >> optimizing took 21 seconds
> >>
> >> Solr 3x
> >> Total Time Taken-> 269 seconds
> >> Total documents added-> 1917728
> >> Docs/sec-> 7129
> >> starting optimize
> >> optimizing took 21 seconds
> >>
> >> Solr trunk. 2011-6-11: 17:24 EST
> >> Total Time Taken-> 592 seconds
> >> Total documents added-> 1917728
> >> Docs/sec-> 3239
> >> starting optimize
> >> optimizing took 159 seconds
> >>
> >> What do folks think? Is there anything I can/should do to narrow this
> down?
> >
> > Hi Eric,
> >
> > this looks weird, I have some questions:
> >
> > - you are indexing into the same disk as you read the data from?
> > - what are you rambuffer settings?
> > - how many threads are you using to send data to solr?
> > - what is your autocommit setting?
> >
> > simon
> >
> >
> >>
> >> Erick
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to