Hi Adrian,

Thanks for the response. Good points too!

We actually went with a smallish benchmark to be able to profile the
application within reasonable time.

We will do a larger benchmark (say, 1M documents, without profiling) and I
will revisit the commit-code as well. (IIRC we actually increased the
commit frequency a while back because of issues (maybe out-of-memory
issues, it was in the Lucene 4.x time. But this might no longer be relevant)

What I don't understand yet is how this difference (between 6 and 7) came
to be, I was reading the change log but could not really pinpoint it. Sure,
the commit's are far from optimal, but we use the same commit strategy
between 6.6 and 7.1.

-Rob




On Wed, Jan 31, 2018 at 1:56 PM, Adrien Grand <jpou...@gmail.com> wrote:

> Hi Rob,
>
> I don't think your benchmark is good. If I read it correctly, it only
> indexes between 21k and 22k documents, which is tiny. Plus it should try to
> better replicate production workload, otherwise we will draw wrong
> conclusions.
>
> I also suspect something is not quite right in your indexing code. When I
> look at the IW logs, 562 out of the 642 flushes only write 1 document. I'm
> not surprised that it exacerbates the cost of checksums, which are cheaper
> to compute on one large file than on many tiny files. For the record, even
> committing every 5k documents still sounds too frequent to me for an
> application that is heavily indexing. Maybe you should consider moving to a
> time-based policy? eg. commit every 10 minutes?
>
> Le mer. 31 janv. 2018 à 10:25, Rob Audenaerde <rob.audenae...@gmail.com> a
> écrit :
>
> > Hi all,
> >
> > We ran the benchmarks (6.6 vs 7.1) with IW info stream and (as attachment
> > cannot be too large) I uploaded them to google drive. They can be found
> > here:
> >
> > https://drive.google.com/open?id=1-nAHgpPO3qZ78lnvvlQ0_lF4uHJ-cWLh
> >
> > Thanks in advance,
> > -Rob
> >
> > On Mon, Jan 29, 2018 at 1:08 PM, Rob Audenaerde <
> rob.audenae...@gmail.com>
> > wrote:
> >
> > > Hi Uwe,
> > >
> > > Thanks for the reply. We commit often. Actually, in the benchmark, we
> > > commit every 60 documents (but we will run a larger set with less
> > commits).
> > > The number of commits we call does not change between 6.6. and 7.1. In
> > our
> > > production systems  we commit every 5000 documents.
> > >
> > > We dug deeper into the commit methods, and currently see the main
> > > difference seems to be the calls to the java.util.zit.Checksum.update(
> ).
> > > The number of calls to that method in 6.6 is around 11M  , and 7.1
> 21M,
> > so
> > > almost twice the calls.
> > >
> > > -Rob
> > >
> > > On Mon, Jan 29, 2018 at 12:18 PM, Uwe Schindler <u...@thetaphi.de>
> wrote:
> > >
> > >> Hi,
> > >>
> > >> How often do you commit? If you index the data initially (that's the
> > case
> > >> where indexing needs to be fast), one would call commit at the end of
> > the
> > >> whole job, so the actual time it takes is not so important.
> > >>
> > >> If you have a system where the index is updated all the time, then of
> > >> course committing is also something you have to take into account.
> > Systems
> > >> like Solr or Elasticsearch use a transaction log in parallel to
> > indexing,
> > >> so they commit very seldom. If the system crashes, the changes are
> > replayed
> > >> from tranlog since last commit.
> > >>
> > >> Uwe
> > >>
> > >> -----
> > >> Uwe Schindler
> > >> Achterdiek 19, D-28357 Bremen
> > >> http://www.thetaphi.de
> > >> eMail: u...@thetaphi.de
> > >>
> > >> > -----Original Message-----
> > >> > From: Rob Audenaerde [mailto:rob.audenae...@gmail.com]
> > >> > Sent: Monday, January 29, 2018 11:29 AM
> > >> > To: java-user@lucene.apache.org
> > >> > Subject: Re: indexing performance 6.6 vs 7.1
> > >> >
> > >> > Hi all,
> > >> >
> > >> > Some follow up (sorry for the delay).
> > >> >
> > >> > We built a benchmark in our application, and profiled it (on a
> > smallish
> > >> > data set). What we currently see in the profiler is that in Lucene
> 7.1
> > >> the
> > >> > calls to `commit()` take much longer.
> > >> >
> > >> > The self-time committing in 6.6: 3,215 ms
> > >> > The self-time committing in 7.1: 10,187 ms.
> > >> >
> > >> > We will try to run a larger data set and also later with the IW info
> > >> > stream.
> > >> >
> > >> > -Rob
> > >> >
> > >> > On Thu, Jan 18, 2018 at 7:03 PM, Erick Erickson <
> > >> erickerick...@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > Robert:
> > >> > >
> > >> > > Ah, right. I keep confusing my gmail lists
> > >> > > "lucene dev"
> > >> > > and
> > >> > > "lucene list"....
> > >> > >
> > >> > > Siiigggghhhhh.
> > >> > >
> > >> > >
> > >> > >
> > >> > > On Thu, Jan 18, 2018 at 9:18 AM, Adrien Grand <jpou...@gmail.com>
> > >> > wrote:
> > >> > > > If you have sparse data, I would have expected index time to
> > >> *decrease*,
> > >> > > > not increase.
> > >> > > >
> > >> > > > Can you enable the IW info stream and share flush + merge times
> to
> > >> see
> > >> > > > where indexing time goes?
> > >> > > >
> > >> > > > If you can run with a profiler, this might also give useful
> > >> information.
> > >> > > >
> > >> > > > Le jeu. 18 janv. 2018 à 11:23, Rob Audenaerde
> > >> > <rob.audenae...@gmail.com>
> > >> > > a
> > >> > > > écrit :
> > >> > > >
> > >> > > >> Hi all,
> > >> > > >>
> > >> > > >> We recently upgraded from Lucene 6.6 to 7.1.  We see a
> > significant
> > >> drop
> > >> > > in
> > >> > > >> indexing performace.
> > >> > > >>
> > >> > > >> We have a-typical use of Lucene, as we (also) index some
> database
> > >> > tables
> > >> > > >> and add all the values as AssociatedFacetFields as well. This
> > >> allows us
> > >> > > to
> > >> > > >> create pivot tables on search results really fast.
> > >> > > >>
> > >> > > >> These tables have some overlapping columns, but also disjoint
> > ones.
> > >> > > >>
> > >> > > >> We anticipated a decrease in index size because of the sparse
> > >> > > docvalues. We
> > >> > > >> see this happening, with decreases to ~50%-80% of the original
> > >> index
> > >> > > size.
> > >> > > >> But we did not expect an drop in indexing performance (client
> > >> systems
> > >> > > >> indexing time increased with +50% to +250%).
> > >> > > >>
> > >> > > >> (Our indexing-speed used to be mainly bound by the speed the
> > >> > Taxonomy
> > >> > > could
> > >> > > >> deliver new ordinals for new values, currently we are
> > >> investigating if
> > >> > > this
> > >> > > >> is still the case, will report later when a profiler run has
> been
> > >> done)
> > >> > > >>
> > >> > > >> Does anyone know if this increase in indexing time is to be
> > >> expected as
> > >> > > >> result of the sparse docvalues change?
> > >> > > >>
> > >> > > >> Kind regards,
> > >> > > >>
> > >> > > >> Rob Audenaerde
> > >> > > >>
> > >> > >
> > >> > >
> > ---------------------------------------------------------------------
> > >> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > >> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >> > >
> > >> > >
> > >>
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >>
> > >>
> > >
> >
>

Reply via email to