Re: TermsEnum.seekExact degraded performance somewhere between Lucene 7.7.0 and 8.5.1.

Adrien Grand Thu, 06 Aug 2020 06:14:05 -0700

I'm puzzled. I would have expected the digest-no-prefix times to be faster
on JDK14 than on your older JDK  for the same reason that digest-lower and
digest-upper got faster.


I wonder if part of the reason why the no-prefix variant is faster is
because it is better at identifying that the digest doesn't exist in the
terms dictionary thanks to the terms index, while the prefix variants might
need to go to the terms dict and decompress data more frequently? But if
that was the case, then you would already be seeing faster lookups on the
no-prefix variant on older versions of Lucene.

If this is an important use-case for you, one way to improve these times
would consist of indexing these digests in their binary form.

On Thu, Jul 30, 2020 at 9:01 AM Trejkaz <[email protected]> wrote:

> On Mon, 27 Jul 2020 at 19:24, Adrien Grand <[email protected]> wrote:
> >
> > It's interesting you're not seeing the same slowdown on the other field.
> > How hard would it be for you to test what the performance is if you
> > lowercase the name of the digest algorithms, ie. "md5;[md5 value in
> hex]",
> > etc. The reason I'm asking is because the compression logic is optimized
> > for lowercase ASCII so removing uppercase letters would help remove the
> > need to encode exceptions, which is one reason I'm thinking why the
> > slowdown might be less on your other field.
>
> It took me a while to get some free time to make a new version of the
> test which doesn't have our own code in it so that I was able to add
> the new field without rewriting a large chunk of our system... but it
> looks like the timing for lowercase prefixes is around the same as
> upper.
>
> This particular test I've ended up doing though is a pathological
> case, as it turned out to have 0 hits in the index despite searching
> for 29 million digests.
>
> -------------------------
> Time for just reading the digest list
> Count = 29459432, time = 1946 ms
> Count = 29459432, time = 1752 ms
> Count = 29459432, time = 1752 ms
> -------------------------
> Times for digest-upper
> Count = 0, time = 40570 ms
> Count = 0, time = 42574 ms
> Count = 0, time = 40121 ms
> -------------------------
> Times for digest-lower
> Count = 0, time = 40462 ms
> Count = 0, time = 40319 ms
> Count = 0, time = 39938 ms
> -------------------------
> Times for digest-no-prefix
> Count = 0, time = 10936 ms
> Count = 0, time = 10857 ms
> Count = 0, time = 10628 ms
> -------------------------
>
> So about 4 times faster on the field with no term prefixes.
> The code for all 3 tests is shared:
>
>     private static void timeDigest(Path md5sFile, IndexReader reader,
> String field, String termPrefix) throws IOException {
>         try (BufferedReader md5sReader =
> Files.newBufferedReader(md5sFile)) {
>             TermsEnum termsEnum = MultiTerms.getTerms(reader,
> field).iterator();
>             PostingsEnum postingsEnum = null;
>
>             long t0 = System.currentTimeMillis();
>             int hitCount = 0;
>
>             while (true) {
>                 String md5 = md5sReader.readLine();
>                 if (md5 == null) {
>                     break;
>                 }
>
>                 if (termsEnum.seekExact(new BytesRef(termPrefix + md5))) {
>                     postingsEnum = termsEnum.postings(postingsEnum,
> PostingsEnum.NONE);
>                     while (postingsEnum.nextDoc() !=
> DocIdSetIterator.NO_MORE_DOCS) {
>                         hitCount++;
>                     }
>                 }
>             }
>
>             long t1 = System.currentTimeMillis();
>             System.out.println("Count = " + hitCount + ", time = " +
> (t1 - t0) + " ms");
>         }
>     }
>
> > In case you're using an old JRE, you might want to try out with a JRE 13
> or
> > more recent. Some of the logic in this lowercase ASCII compression only
> > gets vectorized on JDK13+.
>
> Times for JDK 14.0.2:
>
> -------------------------
> Times for just reading the digest list
> Count = 29459432, time = 2050 ms
> Count = 29459432, time = 2156 ms
> Count = 29459432, time = 1905 ms
> -------------------------
> Times for digest-upper
> Count = 0, time = 24336 ms
> Count = 0, time = 24236 ms
> Count = 0, time = 23986 ms
> -------------------------
> Times for digest-lower
> Count = 0, time = 24440 ms
> Count = 0, time = 23960 ms
> Count = 0, time = 23956 ms
> -------------------------
> Times for digest-no-prefix
> Count = 0, time = 13177 ms
> Count = 0, time = 13095 ms
> Count = 0, time = 13081 ms
> -------------------------
>
>
> Almost a 2:1 speed boost for prefixed timings just by updating the JDK...
>
> The non-prefixed timings seem to be 30% slower than on JDK 8 (WTF?)
> but still win when compared to the prefixed timings alone.
>
> TX
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

-- 
Adrien

Re: TermsEnum.seekExact degraded performance somewhere between Lucene 7.7.0 and 8.5.1.

Reply via email to