Hi Adrian, Thanks for the response. Good points too!
We actually went with a smallish benchmark to be able to profile the application within reasonable time. We will do a larger benchmark (say, 1M documents, without profiling) and I will revisit the commit-code as well. (IIRC we actually increased the commit frequency a while back because of issues (maybe out-of-memory issues, it was in the Lucene 4.x time. But this might no longer be relevant) What I don't understand yet is how this difference (between 6 and 7) came to be, I was reading the change log but could not really pinpoint it. Sure, the commit's are far from optimal, but we use the same commit strategy between 6.6 and 7.1. -Rob On Wed, Jan 31, 2018 at 1:56 PM, Adrien Grand <jpou...@gmail.com> wrote: > Hi Rob, > > I don't think your benchmark is good. If I read it correctly, it only > indexes between 21k and 22k documents, which is tiny. Plus it should try to > better replicate production workload, otherwise we will draw wrong > conclusions. > > I also suspect something is not quite right in your indexing code. When I > look at the IW logs, 562 out of the 642 flushes only write 1 document. I'm > not surprised that it exacerbates the cost of checksums, which are cheaper > to compute on one large file than on many tiny files. For the record, even > committing every 5k documents still sounds too frequent to me for an > application that is heavily indexing. Maybe you should consider moving to a > time-based policy? eg. commit every 10 minutes? > > Le mer. 31 janv. 2018 à 10:25, Rob Audenaerde <rob.audenae...@gmail.com> a > écrit : > > > Hi all, > > > > We ran the benchmarks (6.6 vs 7.1) with IW info stream and (as attachment > > cannot be too large) I uploaded them to google drive. They can be found > > here: > > > > https://drive.google.com/open?id=1-nAHgpPO3qZ78lnvvlQ0_lF4uHJ-cWLh > > > > Thanks in advance, > > -Rob > > > > On Mon, Jan 29, 2018 at 1:08 PM, Rob Audenaerde < > rob.audenae...@gmail.com> > > wrote: > > > > > Hi Uwe, > > > > > > Thanks for the reply. We commit often. Actually, in the benchmark, we > > > commit every 60 documents (but we will run a larger set with less > > commits). > > > The number of commits we call does not change between 6.6. and 7.1. In > > our > > > production systems we commit every 5000 documents. > > > > > > We dug deeper into the commit methods, and currently see the main > > > difference seems to be the calls to the java.util.zit.Checksum.update( > ). > > > The number of calls to that method in 6.6 is around 11M , and 7.1 > 21M, > > so > > > almost twice the calls. > > > > > > -Rob > > > > > > On Mon, Jan 29, 2018 at 12:18 PM, Uwe Schindler <u...@thetaphi.de> > wrote: > > > > > >> Hi, > > >> > > >> How often do you commit? If you index the data initially (that's the > > case > > >> where indexing needs to be fast), one would call commit at the end of > > the > > >> whole job, so the actual time it takes is not so important. > > >> > > >> If you have a system where the index is updated all the time, then of > > >> course committing is also something you have to take into account. > > Systems > > >> like Solr or Elasticsearch use a transaction log in parallel to > > indexing, > > >> so they commit very seldom. If the system crashes, the changes are > > replayed > > >> from tranlog since last commit. > > >> > > >> Uwe > > >> > > >> ----- > > >> Uwe Schindler > > >> Achterdiek 19, D-28357 Bremen > > >> http://www.thetaphi.de > > >> eMail: u...@thetaphi.de > > >> > > >> > -----Original Message----- > > >> > From: Rob Audenaerde [mailto:rob.audenae...@gmail.com] > > >> > Sent: Monday, January 29, 2018 11:29 AM > > >> > To: java-user@lucene.apache.org > > >> > Subject: Re: indexing performance 6.6 vs 7.1 > > >> > > > >> > Hi all, > > >> > > > >> > Some follow up (sorry for the delay). > > >> > > > >> > We built a benchmark in our application, and profiled it (on a > > smallish > > >> > data set). What we currently see in the profiler is that in Lucene > 7.1 > > >> the > > >> > calls to `commit()` take much longer. > > >> > > > >> > The self-time committing in 6.6: 3,215 ms > > >> > The self-time committing in 7.1: 10,187 ms. > > >> > > > >> > We will try to run a larger data set and also later with the IW info > > >> > stream. > > >> > > > >> > -Rob > > >> > > > >> > On Thu, Jan 18, 2018 at 7:03 PM, Erick Erickson < > > >> erickerick...@gmail.com> > > >> > wrote: > > >> > > > >> > > Robert: > > >> > > > > >> > > Ah, right. I keep confusing my gmail lists > > >> > > "lucene dev" > > >> > > and > > >> > > "lucene list".... > > >> > > > > >> > > Siiigggghhhhh. > > >> > > > > >> > > > > >> > > > > >> > > On Thu, Jan 18, 2018 at 9:18 AM, Adrien Grand <jpou...@gmail.com> > > >> > wrote: > > >> > > > If you have sparse data, I would have expected index time to > > >> *decrease*, > > >> > > > not increase. > > >> > > > > > >> > > > Can you enable the IW info stream and share flush + merge times > to > > >> see > > >> > > > where indexing time goes? > > >> > > > > > >> > > > If you can run with a profiler, this might also give useful > > >> information. > > >> > > > > > >> > > > Le jeu. 18 janv. 2018 à 11:23, Rob Audenaerde > > >> > <rob.audenae...@gmail.com> > > >> > > a > > >> > > > écrit : > > >> > > > > > >> > > >> Hi all, > > >> > > >> > > >> > > >> We recently upgraded from Lucene 6.6 to 7.1. We see a > > significant > > >> drop > > >> > > in > > >> > > >> indexing performace. > > >> > > >> > > >> > > >> We have a-typical use of Lucene, as we (also) index some > database > > >> > tables > > >> > > >> and add all the values as AssociatedFacetFields as well. This > > >> allows us > > >> > > to > > >> > > >> create pivot tables on search results really fast. > > >> > > >> > > >> > > >> These tables have some overlapping columns, but also disjoint > > ones. > > >> > > >> > > >> > > >> We anticipated a decrease in index size because of the sparse > > >> > > docvalues. We > > >> > > >> see this happening, with decreases to ~50%-80% of the original > > >> index > > >> > > size. > > >> > > >> But we did not expect an drop in indexing performance (client > > >> systems > > >> > > >> indexing time increased with +50% to +250%). > > >> > > >> > > >> > > >> (Our indexing-speed used to be mainly bound by the speed the > > >> > Taxonomy > > >> > > could > > >> > > >> deliver new ordinals for new values, currently we are > > >> investigating if > > >> > > this > > >> > > >> is still the case, will report later when a profiler run has > been > > >> done) > > >> > > >> > > >> > > >> Does anyone know if this increase in indexing time is to be > > >> expected as > > >> > > >> result of the sparse docvalues change? > > >> > > >> > > >> > > >> Kind regards, > > >> > > >> > > >> > > >> Rob Audenaerde > > >> > > >> > > >> > > > > >> > > > > --------------------------------------------------------------------- > > >> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > >> > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > >> > > > > >> > > > > >> > > >> > > >> --------------------------------------------------------------------- > > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > >> > > >> > > > > > >