Yep, the timings posted were the best speed out of 10 runs in a row.
The profiling was done in the middle of 1000 iterations in a row just
to knock off any warm-up time.

The sort of data we're storing in the field is quite possibly a
worst-case scenario for the compression. The data is mixed digest info
like

"MD5;[md5 value in hex]"
"SHA-1;[sha1 value in hex]"
"SHA-256;[sha256 value in hex]"

In fact, there's another field in the index which contains the same
MD5s without the common prefix - the same sort of operation on that
field doesn't get the same slowdown. (It's a bit slower. Like 5% or
so? Certainly nothing like 100%.) So at least for looking up MD5s we
have the luxury of an alternative option for the lookups. For other
digests I'm afraid we're stuck for now until we change how we index
those.

What's ironic is that we originally put the prefix on to make seeking
to the values faster. ^^;;

TX


On Mon, 27 Jul 2020 at 17:08, Adrien Grand <jpou...@gmail.com> wrote:
>
> Alex, this issue you linked is about the terms dictionary of doc values.
> Trejkaz linked the correct issue which is about the terms dictionary of the
> inverted index.
>
> It's interesting you're seeing so much time spent in readVInt on 8.5 since
> there is a single vint that is read for each block in
> "LowercaseAsciiCompression.decompress". Are these relative timings
> consistent over multiple runs?
>
> On Mon, Jul 27, 2020 at 5:57 AM Alex K <aklib...@gmail.com> wrote:
>
> > Hi,
> >
> > Also have a look here:
> > https://issues.apache.org/jira/plugins/servlet/mobile#issue/LUCENE-9378
> >
> > Seems it might be related.
> > - Alex
> >
> > On Sun, Jul 26, 2020, 23:31 Trejkaz <trej...@trypticon.org> wrote:
> >
> > > Hi all.
> > >
> > > I've been tracking down slow seeking performance in TermsEnum after
> > > updating to Lucene 8.5.1.
> > >
> > > On 8.5.1:
> > >
> > >     SegmentTermsEnum.seekExact: 33,829 ms (70.2%) (remaining time in our
> > > code)
> > >         SegmentTermsEnumFrame.loadBlock: 29,104 ms (60.4%)
> > >             CompressionAlgorithm$2.read: 25,789 ms (53.5%)
> > >                 LowercaseAsciiCompression.decompress: 25,789 ms (53.5%)
> > >                     DataInput.readVInt: 24,690 ms (51.2%)
> > >         SegmentTermsEnumFrame.scanToTerm: 2,921 ms (6.1%)
> > >
> > > On 7.7.0 (previous version we were using):
> > >
> > >     SegmentTermsEnum.seekExact: 5,897 ms (43.7%) (remaining time in our
> > > code)
> > >         SegmentTermsEnumFrame.loadBlock: 3,499 ms (25.9%)
> > >             BufferedIndexInput.readBytes: 1,500 ms (11.1%)
> > >             DataInput.readVInt: 1,108 (8.2%)
> > >         SegmentTermsEnumFrame.scanToTerm: 1,501 ms (11.1%)
> > >
> > > So on the surface it sort of looks like the new version spends less
> > > time scanning and much more time loading blocks to decompress?
> > >
> > > Looking for some clues to what might have changed here, and whether
> > > it's something we can avoid, but currently LUCENE-4702 looks like it
> > > may be related.
> > >
> > > TX
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
> >
>
>
> --
> Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to