Hi Rob,

I don't think your benchmark is good. If I read it correctly, it only
indexes between 21k and 22k documents, which is tiny. Plus it should try to
better replicate production workload, otherwise we will draw wrong
conclusions.

I also suspect something is not quite right in your indexing code. When I
look at the IW logs, 562 out of the 642 flushes only write 1 document. I'm
not surprised that it exacerbates the cost of checksums, which are cheaper
to compute on one large file than on many tiny files. For the record, even
committing every 5k documents still sounds too frequent to me for an
application that is heavily indexing. Maybe you should consider moving to a
time-based policy? eg. commit every 10 minutes?

Le mer. 31 janv. 2018 à 10:25, Rob Audenaerde <rob.audenae...@gmail.com> a
écrit :

> Hi all,
>
> We ran the benchmarks (6.6 vs 7.1) with IW info stream and (as attachment
> cannot be too large) I uploaded them to google drive. They can be found
> here:
>
> https://drive.google.com/open?id=1-nAHgpPO3qZ78lnvvlQ0_lF4uHJ-cWLh
>
> Thanks in advance,
> -Rob
>
> On Mon, Jan 29, 2018 at 1:08 PM, Rob Audenaerde <rob.audenae...@gmail.com>
> wrote:
>
> > Hi Uwe,
> >
> > Thanks for the reply. We commit often. Actually, in the benchmark, we
> > commit every 60 documents (but we will run a larger set with less
> commits).
> > The number of commits we call does not change between 6.6. and 7.1. In
> our
> > production systems  we commit every 5000 documents.
> >
> > We dug deeper into the commit methods, and currently see the main
> > difference seems to be the calls to the java.util.zit.Checksum.update().
> > The number of calls to that method in 6.6 is around 11M  , and 7.1  21M,
> so
> > almost twice the calls.
> >
> > -Rob
> >
> > On Mon, Jan 29, 2018 at 12:18 PM, Uwe Schindler <u...@thetaphi.de> wrote:
> >
> >> Hi,
> >>
> >> How often do you commit? If you index the data initially (that's the
> case
> >> where indexing needs to be fast), one would call commit at the end of
> the
> >> whole job, so the actual time it takes is not so important.
> >>
> >> If you have a system where the index is updated all the time, then of
> >> course committing is also something you have to take into account.
> Systems
> >> like Solr or Elasticsearch use a transaction log in parallel to
> indexing,
> >> so they commit very seldom. If the system crashes, the changes are
> replayed
> >> from tranlog since last commit.
> >>
> >> Uwe
> >>
> >> -----
> >> Uwe Schindler
> >> Achterdiek 19, D-28357 Bremen
> >> http://www.thetaphi.de
> >> eMail: u...@thetaphi.de
> >>
> >> > -----Original Message-----
> >> > From: Rob Audenaerde [mailto:rob.audenae...@gmail.com]
> >> > Sent: Monday, January 29, 2018 11:29 AM
> >> > To: java-user@lucene.apache.org
> >> > Subject: Re: indexing performance 6.6 vs 7.1
> >> >
> >> > Hi all,
> >> >
> >> > Some follow up (sorry for the delay).
> >> >
> >> > We built a benchmark in our application, and profiled it (on a
> smallish
> >> > data set). What we currently see in the profiler is that in Lucene 7.1
> >> the
> >> > calls to `commit()` take much longer.
> >> >
> >> > The self-time committing in 6.6: 3,215 ms
> >> > The self-time committing in 7.1: 10,187 ms.
> >> >
> >> > We will try to run a larger data set and also later with the IW info
> >> > stream.
> >> >
> >> > -Rob
> >> >
> >> > On Thu, Jan 18, 2018 at 7:03 PM, Erick Erickson <
> >> erickerick...@gmail.com>
> >> > wrote:
> >> >
> >> > > Robert:
> >> > >
> >> > > Ah, right. I keep confusing my gmail lists
> >> > > "lucene dev"
> >> > > and
> >> > > "lucene list"....
> >> > >
> >> > > Siiigggghhhhh.
> >> > >
> >> > >
> >> > >
> >> > > On Thu, Jan 18, 2018 at 9:18 AM, Adrien Grand <jpou...@gmail.com>
> >> > wrote:
> >> > > > If you have sparse data, I would have expected index time to
> >> *decrease*,
> >> > > > not increase.
> >> > > >
> >> > > > Can you enable the IW info stream and share flush + merge times to
> >> see
> >> > > > where indexing time goes?
> >> > > >
> >> > > > If you can run with a profiler, this might also give useful
> >> information.
> >> > > >
> >> > > > Le jeu. 18 janv. 2018 à 11:23, Rob Audenaerde
> >> > <rob.audenae...@gmail.com>
> >> > > a
> >> > > > écrit :
> >> > > >
> >> > > >> Hi all,
> >> > > >>
> >> > > >> We recently upgraded from Lucene 6.6 to 7.1.  We see a
> significant
> >> drop
> >> > > in
> >> > > >> indexing performace.
> >> > > >>
> >> > > >> We have a-typical use of Lucene, as we (also) index some database
> >> > tables
> >> > > >> and add all the values as AssociatedFacetFields as well. This
> >> allows us
> >> > > to
> >> > > >> create pivot tables on search results really fast.
> >> > > >>
> >> > > >> These tables have some overlapping columns, but also disjoint
> ones.
> >> > > >>
> >> > > >> We anticipated a decrease in index size because of the sparse
> >> > > docvalues. We
> >> > > >> see this happening, with decreases to ~50%-80% of the original
> >> index
> >> > > size.
> >> > > >> But we did not expect an drop in indexing performance (client
> >> systems
> >> > > >> indexing time increased with +50% to +250%).
> >> > > >>
> >> > > >> (Our indexing-speed used to be mainly bound by the speed the
> >> > Taxonomy
> >> > > could
> >> > > >> deliver new ordinals for new values, currently we are
> >> investigating if
> >> > > this
> >> > > >> is still the case, will report later when a profiler run has been
> >> done)
> >> > > >>
> >> > > >> Does anyone know if this increase in indexing time is to be
> >> expected as
> >> > > >> result of the sparse docvalues change?
> >> > > >>
> >> > > >> Kind regards,
> >> > > >>
> >> > > >> Rob Audenaerde
> >> > > >>
> >> > >
> >> > >
> ---------------------------------------------------------------------
> >> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >> > >
> >> > >
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
> >
>

Reply via email to